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Name:  Legand  L.  Burge,  Jr.  Date  of  Degree:  December  27,  1979 
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Institution:  Oklahoma  State  University  Location:  Stillwater,  Oklahoma 
Title  of  Study:  EFFICIENT  CODING  OF  THE  PREDICTION  RESIDUAL 
Pages  in  Study:  200  Candidate  for  Degree  of  Doctor  of  Philosophy 

Major  Field:  Electrical  Engineering 

Scope  and  Method  of  Study:  Hhis  thesis  presents  an  efficient  method  of 
coding  the  prediction  residual  using  the  technique  of  sub-band  cod¬ 
ing  designed  at  the  bit  rate  of  9600  bits/second.  The  energy  of  the 
prediction  residual  is  used  to  distribute  the  bit  allocation  by  sub¬ 
bands  such  that  the  perceptual  criteria  is  preserved.  The  percep¬ 
tion  is  enhanced  by  transitional  information  within  the  phoneme 
connections  of  speech  by  a  technique  that  weights  the  energy  based 
on  a  normalization  factor.  A  three-tier  phoneme  classification  5 
is  derived  from  an  energy  study  of  the  phonemes  for  the  prediction 
residual.  With  this  it  is  shown  that  speech  intelligibility  is  en¬ 
hanced  in  the  coding  scheme.  The  prediction  residual  is  compared 
with  the  glottal  waveform.  In  association  with  these  results,  a  new 
technique  for  pitch  extraction  is  presented  using  the  prediction  as 
the  input  signal  to  calculate  pitch.  An  adequate  indication  of 
coder  quality  is  described  using  various  types  of  signal-to-noise 
ratios.. 

Findings  and  Conclusions:  The  study  of  the  energy  in  the  prediction 
residual  of  the  phonemes  shows  that  the  prediction  residual  is  a 
suitable  excitation  function  rather  than  the  conventional  two-source 
model.  It  is  shown  that  the  energy  of  the  prediction  residual  divides 
the  phonemes  into  classes  by  phonemic  aggregations,  namely  high 
energy,  low  energy  and  noise  groups.  The  high  energy  group  includes 
the  vowels  and  dipthongs.  The  plosive,  fricative  and  unvoiced  pho¬ 
nemes  compose  the  noise  group.  The  low  energy  group  is  represented 
by  the  glides  and  nasals.  The  bit  allocations  scheme  discussed  in 
this  thesis  is  based  on  this  idea  and  is  shown  to  enhance  the  per¬ 
ceptual  aspects  of  the  decoded  signal.  The  normalization  factor 
introducted  further  enhances  this  quality.  The  sub-band  coder  is 
designed  to  exhibit  good  performance  in  terms  of  signal-to-noise 
ratios  for  objective  measure  of  quality.  The  signal-to-noise  ratio 
is  only  an  indication  for  quantizer  performance  and  generally  must 
be  supplemented  by  subjective  and  perceptual  measurements  for  speech 
coding  and  further  work  is  necessary  in  this  direction. 
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CHAPTER  I 


INTRODUCTION 

1.1  Statement  of  the  Problem 

The  structural  unit  of  speech  composition  is  the  speech  sound  called 
the  phoneme.  Its  variations  are  called  allophones.  It  can  be  said  also 
that  phonemes  relate  to  the  linguistic  basis  of  a  language.  However, 
phonemes  are  not  "bricks,"  i.e.,  the  human  has  been  endowed  with  the 
ability  to  communicate  in  a  continuous  mode.  Because  we  speak  in  an 
uninterrupted  fashion  in  order  to  complete  our  thoughts,  the  phonemic 
structure  connects  itself  by  transitional  cues  for  the  perception  of  cer¬ 
tain  phonemes  [1].  It  is  this  transitional  information  that  is  needed 
for  absolute  discrimination  of  speech  and  speech-like  sounds  [2].  It  is 
the  transitional  information  that  is  needed  for  efficient  excitation  of 
a  speech  synthesizer. 

To  synthesize  intelligible  speech,  the  perceptual  aspects  of  speech 
sounds  have  to  be  used.  In  other  words,  the  ability  for  humans  to  dis¬ 
criminate  and  differentiate  a  speech  sound  with  their  over-learned  senses 
must  be  incorporated  into  the  speech  synthesis  technique.  The  speech 
synthesis  must  include  perceptual  enhancement,  and  the  inclusion  of 
transitional  information  (that  is,  frequency  shifts).  Transitional 
information  is  the  loci  of  frequency  determined  by  the  place  of 

+Some  of  the  words  related  to  the  science  of  the  speech  waveform  are 
defined  in  APPENDIX  A. 
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articulation  that  connects  the  phonemes.  Phonemes  are  the  basic  speech 
sound  element  used  to  make  a  word.  One  can  also  say  that  a  phoneme  is  an 
idealized  structural  unit  of  language  which  serves  to  keep  words  apart. 

It  is  an  astonishing  fact  as  to  how  the  human  brain  stores  rules  to  keep 
track  to  one's  language  for  communicating.  The  object  of  speech  synthe¬ 
sis  is  to  come  as  close  as  possible  to  this  occurrence. 

The  history  of  synthetic  voice  coding  had  its  origination  with  H.  W. 
Dudly  in  1939  [3]  [4].  The  Dudley  speech  reproduction  model  consists  of 
a  filter  representing  the  vocal  tract  resonance  characteristics  driven  by 
an  artificially  synthesized  excitation  signal.  The  filter  and  the  excita¬ 
tion  signal  parameters  are  updated  periodically.  To  determine  the  filter 
characteristics,  Dudley  used  the  Fourier  spectrum  of  the  speech  as  a 
basis.  The  excitation  signal  consists  of  a  pulse  train  for  voiced  sounds 
and  random  noise  for  unvoiced  sounds.  The  model  that  Dudley  has  repre¬ 
sented  is  essentially  the  basis  of  many  methods  today  [5]  [6]  [7].  Some 
of  these  ideas  are  discussed  below. 

A  basic  model  of  the  speech  waveform  is  to  assume  a  linear  quasi 
time-invariant  system  which  responds  to  a  periodic  or  noiselike  excita¬ 
tion.  This  linear  time  invariant  system  represents  the  vocal  tract.  If 
the  vocal  tract  is  assumed  to  be  fixed,  then  the  output  of  the  system  is 
a  convolution  between  the  excitation  and  vocal  tract  transfer  function 
(see  Figure  1 ). 

Recently  considerable  interest  has  been  given  to  methods  of  digital 
analysis  and  synthesis  of  speech  assuming  the  presented  model.  A  method 
that  has  proven  to  be  efficient  for  encoding  the  speechwave  is  linear 
prediction  [6].  The  linear  predictive  encoder  was  developed  to  improve 
the  channel  vocoder  voice  quality  and  intelligibility  [7].  The  difference 
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between  the  linear  predictive  coded  (LPC)  vocoder  and  the  channel  vocoder 
is  the  filter.  There  are  two  types  of  LPC  vocoders,  a  pitch-excited  and 
a  residual-excited.  The  difference  between  the  two  is  how  the  excitation 
signal  is  characterized  for  the  synthesis  filter.  In  the  pitch-excited 
LPC  vocoder,  the  model  of  the  vocal  tract,  with  glottal  flow  and  radia¬ 
tion,  is  represented  by  the  predictor  coefficients.  These  coefficients 
are  transmitted  together  with  the  information  regarding  the  excitation  of 
the  speech,  i.e.,  pitch,  voiced/unvoiced  decision  and  the  gain.  Much 
research  has  been  done  toward  the  pitch-excited  LPC  vocoder.  Two  methods 
have  been  discovered,  the  autocorrelation  [8]  [9]  and  the  covariance  [6] 
methods.  The  residual -excited  methods  can  be  characterized  the  same  way. 
However,  instead  of  using  pitch,  voiced/unvoiced  desicion  and  gain,  the 
residual  is  encoded  and  transmitted.  The  residual  is  the  difference  be¬ 
tween  the  actual  and  predicted  speech  signals.  This  technique  also  car¬ 
ries  the  name  adaptive  predictive  coding  (APC).  The  channel  vocoder,  on 
the  other  hand,  uses  a  set  of  narrowband  filters  whereas  the  linear  pre¬ 
dictor  uses  an  all  pole  digital  filter.  The  linear  predictive  filter 
describes  the  frequency  response  of  the  vocal  tract  system  by  the  pre¬ 
dictor  coefficients.  Its  function  is  to  decompose  the  speech  into  two 
waveforms.  One  waveform  represents  the  parameters  that  are  time-varying 
such  as  predictor  coefficients,  partial  correlation  coefficients  and 
other  parameters  that  represent  the  formant  frequency  characteristics. 

The  other  waveform  is  the  prediction  residual.  Figure  2  describes  a 
block  diagram  of  the  LPC  analysis. 

The  prediction  residual  is  the  ideal  signal  for  an  excitation  func¬ 
tion  for  the  linear  predictive  analysis  and  synthesis  model  because  it 
contains  the  actual  information  instead  of  the  pseudo-model,  a  pulse 
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train  or  random  noise  [10].  In  addition,  phasing  information  is  embedded 
in  the  prediction  residual.  Furthermore,  since  the  analysis  filter  is 
the  inverse  of  the  synthesis  filter,  the  decomposed  waveform  can  be  re¬ 
constructed  to  form  the  input  speech  waveform  by  updating  the  parameters 
[7]  [11].  The  prediction  residual  also  follows  the  actual  speech  excita¬ 
tion  model  g(t)  in  Figure  1  [12].  The  function,  g(t),  represents  the 
glottal  pulse  which  is  also  called  the  glottal  volume  velocity  at  the 
vocal  cords  or  glottis.  In  order  to  ideally  model  the  voice  reproduction 
system,  it  is  necessary  to  use  a  system  whose  properties  are  similar 
acoustically  to  the  glottis  and  vocal  tract.  It  is  best  to  model  the 
excitation  signal  with  an  analogous  function  to  the  glottis  waveform  for 
input  to  the  vocal  tract.  It  is  well  known  that  for  nonnasal  voiced 
speech  sounds,  the  transfer  functions  have  no  zeros  [5].  For  these  par¬ 
ticular  sounds,  the  vocal  tract  filters  can  be  approximated  by  an  all 
pole  filter.  It  is  also  known  that  the  shape  and  periodicity  of  the 
glottis  excitation  are  subject  to  large  variations  [12].  However,  with 
the  linear  predictive  model  the  features  of  the  glottal  flow,  the  vocal 
tract,  and  the  radiation,  which  is  the  output  from  the  mouth,  are  included 
into  a  single  recursive  filter.  To  separate  the  glottal  flow  from  the 
vocal  tract  involves  a  deconvolution.  Some  authors  have  avoided  this  sep¬ 
aration  of  the  source  function;  however,  the  artificial  excitation  used  by 
them  represents  only  a  good  approximation  to  the  prediction  residual  for 
unvoiced  sounds.  Moreover,  for  voiced  sounds,  the  artificial  excitation 
could  be  improved.  The  prediction  residual  should  be  used  for  the  excita¬ 
tion  function,  because  it  contains  the  following  characteristics: 

1.  It  is  repetitive  at  the  pitch  frequency. 

2.  It  has  basically  a  flat  amplitude  spectrum;  however,  it  includes 
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details  that  relate  to  the  suprasegmentals  of  the  individual  and  of  the 
spoken  words. 

3.  It  includes  the  noisiness  of  opening  and  closing  of  the  glottal 
mechanism  indicating  phase  information. 

4.  It  includes  the  fact  that  voiced  fricatives  and  stops  are  a  com¬ 
bination  of  noise  and  a  repetitive  signal. 

Noting  the  speech  characteristics  in  the  residual  signal,  several 
authors  have  investigated  the  coding  aspects  of  the  prediction  residual 
[13-32].  However,  the  speech  intelligibility  aspects,  such  as  Articula¬ 
tion  Index  (AI)  [29],  have  not  been  used  in  these.  The  Articulation  Index 
concept  has  been  used  effectively  in  the  sub-band  coding  of  speech  [36]. 

The  sub-band  coding,  based  upon  AI,  allows  for  an  efficient  bit  distribu¬ 
tion  in  coding.  This  thesis  combines  all  these  ideas  and  presents  an 
efficient  method  of  coding  the  prediction  residual  using  the  concepts  of 
sub-band  coding.  A  literature  survey  related  to  these  areas  is  presented 
in  the  next  section. 

One  important  aspect  of  coding  is  bit  rate.  For  certain  narrow  band 
rates,  the  coding  of  the  prediction  residual  is  not  feasible  [13].  Also, 
it  has  been  shown  that  9,600  bits/second  is  feasible  for  transmission  of 
residual  and  filter  parameters,  and  is  practical  over  voice  grade  lines 
[35].  In  the  future,  lower  data  rates  have  to  be  used  for  cost  effec¬ 
tiveness.  At  present,  rates  below  6,000  bits/second  yield  speech  quality 
of  a  synthetic  nature.  Rates  between  6,000  bits/second  and  16,000  bits/ 
second  demonstrates  good  communication  quality.  Studies  have  shown  and 
present  operating  equipment  demonstrate  that  a  16,000  bits/second  trans¬ 
mission  rate  and  above  yield  toll  telephone  quality.  The  thrust  of  the 
governmental  community  for  designing  voice  switch  networks  has  been 
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recently  toward  9,600  bits/second  rate.  At  this  rate,  the  communicators 
can  comprehend  the  language  spoken;  however,  there  is  some  drop-off  in 
Speaker  recognition  but  not  as  drastic  as  at  rates  closer  to  6,000  bits/ 
second.  With  the  advent  of  mircroprocessing  systems  more  sophisticated 
algorithms  can  be  implemented  with  small  monetary  investments.  This 
thesis  presents  the  coding  and  decoding  of  the  residual  signal  using  sub¬ 
band  coding  at  a  data  rate  of  9,600  bits/second. 

1.2  Review  of  the  Literature 

Predictive  systems  related  to  speech  have  evolved  through  the  years. 

A  brief  survey  of  these  systems  is  presented  below.  In  earlier  studies 
of  predictive  coding  systems  with  applications  to  speech  signals,  the 
linear  predictors  were  limited  to  fixed  coefficients  in  an  interval  [17]. 
In  more  recent  studies,  it  was  found  that  since  the  speech  signal  has  non¬ 
stationary  properties,  the  linear  predictor  does  not  efficiently  predict 
the  signal  at  each  interval.  In  work  by  Atal  and  Schroeder  [6],  an  adap¬ 
tive  predictive  system  took  into  account  the  quasi-periodicity  of  speech 
signals.  In  addition  to  being  the  classic  forerunner  for  adaptive  pre¬ 
dictive  coding  (APC)  of  speech  signals,  this  is  a  more  elaborate  predictor 
than  the  one  with  fixed  coefficients  which  is  suited  for  characteristics 
of  speech  sounds.  Basically,  the  residual  signal  along  with  the  predictor 
provides  sufficient  information  for  the  receiver  to  regenerate  the  input. 
In  this,  pitch  is  determined  from  the  residual  signal.  Atal  and  Schroeder 
[22]  have  examined  predictive  coding  of  speech  signals  recently.  They 
have  shown  that  speech  quality  can  be  improved  by  masking  quantizer  noise 
over  the  speech  signal.  Atal  and  Hanauer  [5]  described  an  efficient  en¬ 
coding  of  the  speech  wave  by  representing  it  in  terms  of  time-varying 
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parameters  related  to  a  transfer  function  of  the  vocal  tract  and  by  model¬ 
ing  the  excitation. 

In  work  by  Dunn  [13],  the  linear  predictive  coded  residual  signal  was 
generated  by  a  feed-forward  linear  predictive  coding  (LPC)  analyzer  and 
encoded  using  delta  modulation  (DM).  The  signal  was  transmitted  at  a  bit 
rate  of  9,600  bits/second.  Gibson,  Jones,  and  Melsa  [14]  have  introduced 
a  method  called  sequential  adaptive  prediction  which  utilized  differential 
pulse  code  modulation  (DPCM)  with  an  adaptive  quantizer  and  an  adaptive 
predictor  using  Kalman  filtering.  This  work  was  improved  upon  by  Cohn  and 
Melsa  [15]  using  adaptive  differential  pulse  code  modulation  (ADPCM)  for 
encoding  the  prediction  residual.  A  method  using  the  Kalman  filter  for 
the  adaptive  predictive  encoder  was  introduced  by  Goldberg  and  others 
[16].  This  system  was  real  time  APC  that  was  implemented  on  a  minicompu¬ 
ter.  An  adaptive  residual  coding  using  an  adaptive  predictor,  adaptive 
quantizer,  and  a  variable  length  coder  was  studied  by  Qureshi  and  Forney 
[18],  In  these  studies,  a  class  of  speech  digitization  algorithms  is 
described  for  use  at  bit  rates  of  9,600  to  16,000  bits/second.  These  sys¬ 
tems  involve  an  adaptive  predictor,  an  adaptive  quantizer,  and  a  variable 
length  coder.  This  is  a  practical  version  of  a  residual  encoder  previous¬ 
ly  studied  by  Melsa  and  others  [14].  Most  recently,  the  method  of  vari¬ 
able  length  coding  of  the  prediction  residual  was  studied  by  Berouti  and 
Makhoul  [19],  This  system  of  APC  uses  a  noise  spectral  shaping  filter  to 
solve  the  granular  noise  quantization  problem  and  an  indefinite  quantizer 
to  solve  the  overload  quantizing  problem. 

A  voice-excited  predictive  coder  (VEPC)  by  Esteban  and  others  [20] 
uses  a  baseband  excitation  of  the  residual  and  splitband  coding  by  signal 
decimation/interpolation.  Furthermore,  quadrature  mirror  filters  are 
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implemented  in  order  that  the  aliasing  properties  could  be  taken  advan¬ 
tage  of  in  the  synthesizer. 

The  most  recent  work  by  Cohn  and  Melsa  [21]  [23]  involves  the  imple¬ 
mentation  of  a  speech  coding  algorithm  for  digital  transmission  of  speech 
at  9,600  bits/second  using  a  sequential,  adaptive  linear  predictive  coder, 
an  adaptive  source  coder,  and  multipath  tree-searching  algorithm  to  gen¬ 
erate  quality  speech.  This  is  an  extension  of  the  previous  work  done  on 
a  residual  encoder  which  was  an  improved  ADPCM  system  for  speech  digiti¬ 
zation.  Chang  [24]  has  extended  this  work  and  incorporated  a  noise  re¬ 
sistant  code  for  transmission. 

In  work  by  Magi  11  and  others  [25],  a  feed-forward  LPC  analyzer  was 
used  with  an  encoding  method  of  Adaptive  Delta  Modulation  (ADM)  and  an 
experimental  method  of  encoding  the  residual  by  DPCM.  This  is  referred 
to  as  a  residual  excited  linear  predictive  (RELP)  vocoder.  It  combines 
the  advantages  of  linear  predictive  coding  and  voice-excited  vocoding. 

Recently,  Dankberg  and  Wong  [26]  have  implemented  a  new  version  of 
the  RELP  vocoder.  Their  results  have  included  a  development  of  a  pitch 
predicted  ADPCM  residual  encoder  and  a  harmonic  generator.  Viswanathan 
and  others  [27]  considered  the  use  of  voice-excited  linear  predictive 
(VELP)  and  RELP  coders  for  speech.  They  have  studied  in  detail  the  var¬ 
ious  aspects  of  these  coders  and  have  attempted  to  maximize  speech  qual¬ 
ity  as  a  result.  They  also  studied  the  advantages  and  disadvantages  of 
baseband  residual  transmission  and  baseband  speech  transmission. 

In  recent  work,  Kang  [28]  studied  the  development  of  a  narrowband 
voice  digitizer  that  improves  speech  quality,  intelligibility  and  relia¬ 
bility.  The  principle  of  LPC  is  used  in  implementing  the  lattice  filter 
for  the  analysis  and  synthesis.  Itakura  and  Saito  [9]  [30]  have  used 
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the  lattice  method  for  LPC  analysis  of  speech.  The  thrust  has  been  for 
improved  quantization  of  partial  correlation  (PARCOR)  coefficients. 

Makhoul  [31]  has  presented  a  class  of  stable  and  efficient  lattice  meth¬ 
ods  for  linear  prediction  of  speech.  In  this,  an  indepth  study  is  made 
on  PARCOR  coefficients.  If  the  all  pole  function  is  stable,  then  the 
lattice  obtained  from  this  is  stable;  furthermore,  since  the  PARCOR  co¬ 
efficients  are  bounded,  stability  is  guaranteed  and  an  efficient  quanti¬ 
zation  method  can  be  used. 

In  work  by  Flanagan  [32],  it  is  shown  that  the  residual  approximates 
the  glottal  waveform.  In  any  excitation  system,  the  closer  one  can 
approximate  the  physical  model ,  the  better  response  one  gets  from  the 
system.  Flanagan's  work  enhances  this  concept  to  use  the  residual  wave¬ 
form  as  the  excitation  to  the  speech  synthesizer. 

Rabiner  and  others  [33]  have  studied  the  LPC  error  signal.  The  work 
investigated  the  variation  of  the  prediction  error  as  a  function  of  posi¬ 
tion  in  an  analysis  frame  within  a  single  stationary  speech  segment.  The 
error  signal  has  the  frequency  range  of  the  actual  speech. 

The  work  of  Goodman  [34]  found  the  analog  signal  can  be  divided  into 
several  nonoverlapping  frequency  bands.  Each  band  can  be  sampled  and 
quantized  independently.  The  result  is  an  improvement  in  encoding  effi¬ 
ciency  over  straight  sampling  and  quantizing  of  signals  that  are  spectrum 
peaked.  Crochiere  and  others  [36]  [37]  have  applied  this  to  speech  sig¬ 
nals  in  the  digital  domain.  This  is  referred  to  as  sub-band  coding  (SBC). 
This  approach  provides  a  means  of  controlling  and  reducing  quantization 
noise  in  the  coding. 

A  pilot  study  of  speech  waveform  coding  techniques  were  studied  by 
Tribolet  and  others  [38].  The  study  compared  subjective  ratings  to  the 
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various  quality  (objective)  measures  for  speech  waveform  coders.  Tribo- 
let  and  others  examined  four  different  speech  waveform  coder  algorithms 
for  low-bit  rate  applications,  and  studied  these  relationships  for  over¬ 
all  objective  and  subjective  ratings  for  quality.  The  algorithms  were: 
adaptive  differential  PCM  with  a  fixed  predictor  (ADPCM-F),  sub-band  cod¬ 
ing  (SBC),  ADPCM  with  a  variable  predictor  (ADPCM-V)  and  adaptive  trans¬ 
form  coding  (ATC).  The  transmission  rates  studied  were  24,000,  16,000, 
and  9,600  bits/second.  The  objective  measures  used  were  a  conventional 
signal-to-noise  ratio,  frequency  weighted  signal-to-noise  ratio,  log 
likelihood  ratio,  and  an  articulatory  bandwidth  measure.  The  results  of 
the  study  were  that  if  complexity/cost  was  of  no  concern,  then  ATC  is  the 
most  attractive  of  the  group  coders.  However,  if  complexity/cost  was  a 
concern,  then  SBC  is  an  attractive  choice.  ADPCM-F  had  the  poorest  qual¬ 
ity  for  its  complexity;  ADPCM-V  was  the  most  costly  for  its  quality.  The 
transform  coding  and  the  sub-band  coding  will  be  explained  in  detail  in 
Chapter  II. 

In  the  work  by  Barabell  and  Crochiere  [39]  a  new  design  of  the  sub¬ 
band  coding  has  been  implemented  for  low-bit  rate  coding  of  speech.  This 
study  applied  quadrature  filters  to  SBC.  This  method  has  also  employed 
pitch  prediction  within  the  sub-bands.  Crochiere  [40]  has  implemented  a 
novel  approach  for  pitch  extraction  in  the  SBC.  The  method  uses  digital 
linear  phase  shifters  based  on  a  bandpass  interpolation  scheme  to  achieve 
the  non-integer  delays  necessary  in  the  feedback  loop  for  the  pitch  pre¬ 
dictors.  It  uses  the  fractional  sample  delay  in  the  pitch  loop  and  per¬ 
mits  the  processing  of  the  pitch  prediction  in  each  sub-band  to  be 
performed  at  the  sub-band  sampling  rate  which  contributes  to  the  effi¬ 
ciency  of  the  algorithm. 
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Pitch  detection  algorithms  that  have  been  mentioned  above  have  one 
basic  goal.  That  is,  make  a  voiced  or  unvoiced  decision  and  during  cer¬ 
tain  periods  of  voiced  sounds,  estimate  the  pitch  period. 

There  are  three  areas  of  categorization  for  pitch  detectors.  First, 
there  is  a  group  that  uses  time-domain  properties  of  speech  signals. 

These  pitch  detectors  operate  directly  on  the  speech  waveform  in  order  to 
estimate  the  pitch  period.  The  measurements  that  are  usually  taken  are 
minimum  and  maximum  amplitude,  zero-crossing  and  autocorrelation  measure¬ 
ments.  With  these  detectors,  it  is  assumed  the  formant  structure  has  been 
minimized  by  preprocessing  the  speech.  A  second  category  for  pitch  detec¬ 
tion  algorithms  uses  frequency-domain  properties  of  speech  signals.  A 
periodic  signal  in  the  tiine-domain  will  consist  of  a  series  of  impulses 
in  the  frequency-domain  located  at  the  fundamental  frequency  and  its  har¬ 
monics.  Therefore,  one  can  make  measurements  in  the  frequency  domain  to 
determine  the  pitch  period.  The  final  group  combines  both  time  and  fre¬ 
quency-domain  concepts  of  the  speech  signals  in  order  to  determine  pitch 
period.  This  is  a  technique  that  is  used  which  flattens  the  signal  with 
frequency-domain  techniques  and  subsequently  uses  autocorrelation  mea¬ 
sures  to  estimate  the  pitch  period.  These  are  called  hybrid  techniques. 
Previous  work  of  the  pitch  detection  algorithms  and  related  works  that 
have  been  published  will  be  discussed. 

There  are  several  documented  pitch  extraction  methods  that  have  been 
published  recently.  In  earlier  methods,  analysis  of  the  speech  time  wave¬ 
form  were  attempted  by  visual  inspection  of  spectrograms  which  involved 
the  manual  determination  of  pitch  [41].  At  this  time  the  authors  noted 
the  requirement  for  an  automatic  scheme  of  some  kind.  Pinson  [42]  used 
the  method  of  Mathews,  Miller,  and  David  [41]  to  estimate  a  time-domain 
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synchronous  pitch  which  in  turn  was  used  to  determine  frequencies  and 
bandwidths  of  vowel  formants. 

Sondhi  [43]  introduced  three  methods  for  finding  the  pitch  period. 
The  first  method  spectrum  flattens  the  signal  and  corrects  the  phase  to 
synchronize  harmonics.  A  second  method  by  Sondhi  also  flattens  the  spec¬ 
trum  but  adds  an  autocorrelation  to  determine  pitch.  The  third  method 
center  clips  the  speech  signal  and  uses  autocorrelation  for  determination 
of  pitch.  Using  the  method  by  Sondhi,  a  real-time  digital  hardware  pitch 
detector  was  implemented  by  Dubnowski ,  Schafer,  and  Rabiner  [44]. 

There  are  also  methods  that  make  use  of  the  power  spectrum  in  the 
determination  of  the  pitch.  One  such  method  is  called  cepstrum  pitch 
determination.  The  cepstrum  is  defined  as  the  power  spectrum  of  the  log¬ 
arithm  of  the  power  spectrum,  or  mathematically  expressed,  the  cepstrum, 
Q(t)  [45]  [46],  is 

Q(t)  ~  L  f  log j  F(u)  | 2  cos  (ojt )  dm]2  (1.1) 

JQ 

where  f(t)  is  the  speech  signal,  u  is  the  frequency  in  radians,  and 

/CO 

f(t)  e"Jwt  dt  (1.2) 

More  recently,  using  digital  inverse  filtering  techniques,  Markel 
has  innovated  a  method  for  estimating  the  fundamental  frequency  of  voiced 
speech  using  time-domain  analysis.  This  method  has  been  referred  to  as  a 
simplified  inverse  filter  tracking  (SIFT)  algorithm  [47].  The  pitch  per¬ 
iod  is  estimated  by  an  interpolation  of  the  autocorrelation  function  in 
the  neighborhood  of  the  peak  of  the  autocorrelation  function. 


Another  recent  algorithm  that  determines  the  fundamental  frequency 
of  sampled  speech  is  implemented  by  segmenting  the  signal  into  pitch  per¬ 
iods.  This  is  done  by  identifying  the  beginning  of  each  pitch  period. 

This  algorithm  is  called  the  data  reduction  pitch  detector  by  Miller  [48]. 
To  obtain  the  appropriate  identity  of  the  beginning  of  the  pitch  period, 
the  method  detects  the  cycles  of  the  waveform  based  on  intervals  between 
major  zero  crossings.  The  rest  of  the  algorithm  determines  principal 
cycles,  which  correspond  to  true  pitch  periods. 

In  work  presented  by  Gold  [49],  it  is  assumed  that  pitch  extraction 
could  be  obtained  by  a  visual  inspection  of  the  speech  wave  and  is  the 
best  obtainable.  The  computer  program  contains  essentially  four  sections. 
First,  a  voiced/unvoiced  decision  is  made  and  the  two  portions  are  sepa¬ 
rated.  Each  voiced  portion  is  labeled  as  relative  maximum,  then  the  peak 
detector  is  compiled.  The  third  decision  is  to  determine  the  spacing; 
this  in  turn  determines  which  samples  will  be  called  pitch  peaks.  Finally, 
a  procedure  is  necessary  to  eliminate  spurious  peaks  and  add  into  the 
speech  missing  pitch  peaks.  The  program  is  implemented  such  that  editing 
can  make  the  best  pitch  selection. 

The  work  of  Gold  and  Rabiner  [50]  using  parallel  processing  for  esti¬ 
mating  pitch  is  a  modified  version  of  Gold  [49].  A  series  of  measurements 
are  made  to  find  the  peaks  and  valleys  of  the  signals.  There  are  six 
cases  used  to  determine  this.  Each  is  followed  to  determine  if  the  sample 
will  be  an  impulse  or  zero.  The  rules  of  this  are: 

1.  An  impulse  equal  to  the  peak  of  the  signal  occurs  at  the  point  of 
each  peak  in  time. 

2.  An  impulse  equal  to  the  difference  between  the  signal  present 
peak  and  the  past  peak  amplitude  occurs  at  the  point  of  each  peak  in  time. 


3.  An  impulse  equal  to  the  difference  between  the  signal  present  peak 
and  the  past  peak  amplitude  occurs  at  the  point  of  each  peak  in  time.  (If 
the  difference  is  negative,  then  it  is  set  to  zero.) 

4.  An  impulse  equal  to  the  negative  of  the  peak  of  the  signal  occurs 
at  each  negative  peak  in  time. 

5.  An  impulse  equal  to  the  negative  of  the  peak  at  each  negative 
peak  plus  the  peak  of  the  preceding  negative  peak  occurs  at  each  negative 
peak  in  time. 

6.  An  impulse  equal  to  the  negative  of  the  peak  at  each  negative 
peak,  plus  the  negative  of  the  preceding  local  minimum  occurs  at  each 
negative  peak.  (If  this  difference  is  negative,  then  the  impulse  is  set 
to  zero. ) 

From  this  technique  six  estimates  are  formed.  These  estimates  are 
combined  with  the  two  most  recent  estimates  for  each  of  the  six  pitch 
detectors.  The  values  are  then  compared  within  an  acceptable  tolerance; 
the  decision  is  made  for  the  most  occurrences.  This  value  is  declared 
the  pitch  at  that  time.  An  unvoiced  decision  is  made  when  there  is  an 
inconsistency  between  the  comparisons  for  the  pitch  period. 

Another  method  by  Atal  [51]  is  based  upon  LPC.  This  detector  ini¬ 
tializes  with  a  voiced/unvoiced  decision.  Upon  being  classified  as 
voiced,  the  speech  is  low-pass  filtered  and  then  decimated  by  five  to  one. 
The  method  uses  a  41 -pole  LPC  analysis  on  40  ms  seconds  of  frame  data  to 
generate  the  speech  harmonics.  Then,  a  Newton  transformation  is  used  to 
spectrally  flatten  the  speech.  A  peak  picker  determines  the  pitch  period 
at  the  five  to  one  decimated  rating.  Then,  the  signal  is  interpolated  and 
a  higher  resolution  is  used  to  obtain  the  pitch  period. 
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The  average  magnitude  difference  function  (AMDF)  pitch  extractor 
[52]  is  a  variation  of  autocorrelation  analysis  to  determine  the  pitch 
period  of  voiced  speech  sounds.  This  method  takes  advantage  of  the  per¬ 
iodicity  of  voiced  speech.  It  calculates  a  difference  function  that  at 
multiples  of  the  pitch  period  will  dip  sharply  when  the  delayed  speech 
and  original  speech  are  compared.  The  AMDF  function  is  implemented  with 
subtraction,  addition,  and  absolute  value  operations,  whereas  autocorrel¬ 
ation  methods  use  addition  and  multiplication  operations.  For  this  rea¬ 
son,  the  AMDF  function  is  attractive  for  real-time  operations. 

Another  real-time  pitch  extraction  method,  based  on  linear  predic¬ 
tive  techniques,  is  presented  by  Maksym  [53].  The  method  employs  a  non¬ 
stationary  error  process  from  the  adaptive  predictive  coder  by  Atal  [5]. 
The  algorithm  in  addition  to  pitch  period  extraction  also  detects  voiced 
speech.  The  basis  of  the  method  uses  a  predictive  one-bit  quantizer  with 
an  adaptive  algorithm  for  determining  prediction  coefficients.  Since  the 
method  operates  on  the  short-term  prediction  of  the  speech  waveform,  the 
presence  of  the  glottal  excitation  can  be  detected. 

A  semiautomatic  pitch  detector  (SAPD)  [54]  has  been  presented  by 
McGonegal ,  Rabiner,  and  Rosenberg.  This  method  semi  automatically  deter¬ 
mines  the  pitch  contour  of  an  utterance.  An  autocorrelation  of  the  speech 
is  generated.  The  cepstrum  of  the  unfiltered  speech  is  computed.  These 
displays  are  shown  on  a  scope  on  a  frame-by-frame  basis.  The  computed 
pitch  period  for  each  waveform  is  marked  by  and  is  displayed  to  the  user. 
With  the  incorporation  of  the  three  waveforms,  an  extremely  accurate  mea¬ 
sure  is  found.  The  processing  is  lengthy  for  an  utterance;  however,  ro¬ 
bustness  and  accuracy  of  the  results  can  be  a  trade-off  for  many  appli- 
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cations. 


A  recent  method  for  estimating  pitch  period  in  the  presence  of  noise 
of  voiced  sounds  is  based  on  a  maximum  likelihood  formulation  [55].  This 
scheme  is  designed  to  be  resistant  to  white,  Gaussian  noise.  A  new  sig¬ 
nal  is  formed  from  the  speech  signal  with  a  maximizing  function  to  enhance 
the  peaks  for  short  periods.  The  function  is  formed  by  an  autocorrelation 
of  the  speech.  It  provides  accurate  estimates  of  the  pitch  period  and  can 
be  used  to  determine  formant  structure.  It  is  compared  with  the  cepstrum 
method  to  perform  better  under  the  white  noise  conditions. 

An  automatic  pitch  extraction  method  was  developed  by  Markel  [56] 
which  also  determines  formant  frequency  tracking.  This  method  is  similar 
to  the  cepstral  analysis.  The  technique  uses  two  FFT's  to  obtain  the 
sequence  from  which  the  pitch  is  extracted.  The  difference  between  this 
method  and  the  cepstral  method  is  the  procedure  for  determining  the 
voiced/unvoiced  decision. 

An  accurate  method  based  on  the  prediction  residual  is  the  method  by 
Atal  and  Hanauer  [5].  The  speech  is  low-pass  filtered  and  each  sample  is 
raised  to  a  third  power  to  emphasize  the  high  amplitudes  of  the  speech 
waveform.  A  pitch-synchronous  correlation  analysis  is  performed  of  the 
cubed  speech.  A  voiced/unvoiced  decision  is  made  in  this  technique.  A 
second  method  is  based  on  a  linear  prediction  representation  of  the  speech 
waveform.  Each  sample  is  predicted  from  the  previous  n  samples,  and 
therefore  the  correlation  is  not  good  at  the  beginning  of  the  pitch  per¬ 
iod.  The  error  is  large  at  the  beginning.  The  basis  of  the  technique  is 
to  use  peak  picking  for  the  pitch  detection. 

Another  accurate  method  has  been  described  by  Itakura  and  Saito  [57]. 
This  method  determines  the  prediciton  error  signal  by  the  method  of  lat¬ 
tice  filter  formulation.  The  pitch  period  is  determined  by  computing 
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autocorrelation  coefficients  of  the  residual.  A  set  threshold  compares 
the  autocorrelation  for  a  voiced/unvoiced  decision  with  the  pitch  period. 

A  two  stage  method  was  developed  by  Boll  [58]  to  determine  the  pitch 
period.  The  method  is  based  on  the  Itakura  [57]  algorithm.  It  is  built 
by  adding  the  initialization  of  each  frame  based  on  the  preceeding  frame 
results.  The  portion  of  the  autocorrelation  function  of  the  residual  in 
the  range  where  a  pitch  pulse  is  expected  and  the  basis  of  the  a  priori 
information  is  computed  in  each  frame.  The  savings  in  computation  is 
signi ficant. 

Two  methods  were  developed  by  Barnwell  and  others  [59].  These  algor¬ 
ithms  are:  1)  the  multiband  pitch  period  (MBPP)  estimator,  and  2)  the 
skip-sample  recursive  least  squares  pitch  position  estimator.  The  multi¬ 
band  pitch  period  estimator  first  filters  the  speech  waveform  into  four 
bands  across  the  frequency  regions  where  a  fundamental  is  expected  to 
occur.  The  bandwidths  of  these  filters  are  chosen  so  that  only  one  of 
the  outputs  will  be  expected  to  contain  the  fundamental.  Zero-crossing 
pitch  detectors  operate  on  the  outputs  of  each  of  the  filters.  The  in¬ 
formation  derived  from  the  zero-crossing  detectors  is  used  as  a  basis  for 
logical  operations  to  produce  pitch  period  estimates.  The  skip  sample 
recursive  least  squares  technique  is  based  on  a  recursive  least  squares 
linear  predictive  coder.  The  coder  operates  on  a  lower  sampling  rate 
than  a  linear  predictive  coder  and  it  uses  fewer  coefficients  than  the 
predictive  filter.  This  approach  permits  the  original  sampling  time 
resolution  to  be  retained.  The  method  produces  a  sharp  residual  signal 
whose  pitch  pulses  can  be  used  to  determine  the  period. 

The  future  trend  is  towards  efficient  low-bit  rate  coding  that  en¬ 
hances  the  perceptual  quality  and  intelligibility  of  speech.  The  coding 
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of  the  residual  signal  is  one  way  of  arriving  at  the  desired  goal.  This 
thesis  presents  such  an  idea  along  with  a  novel  approach  to  pitch  extrac¬ 
tion.  The  next  section  presents  the  organization  of  the  thesis. 

1.3  Organization  of  the  Thesis 

Chapter  II  presents  the  basic  ideas  associated  with  the  concept  of 
the  prediction  residual.  A  discussion  of  the  mechanism  of  speech  produc¬ 
tion  as  related  to  the  makeup  of  speech  articulation  is  presented  in 
speech  science  terms.  A  model  of  the  vocal  tract  is  presented  in  mathe¬ 
matical  terms  and  the  residual  is  presented  in  an  algorithm  form.  The 
method  of  short-time  analysis  is  presented.  A  new  method  for  determining 
pitch  implementation  is  presented  using  the  residual  waveform  as  the 
source  function. 

Chapter  III  presents  some  of  the  general  ideas  associated  with  cod¬ 
ing  of  speech  along  with  some  applications.  The  method  of  transform  cod¬ 
ing  (TC)  is  compared  to  the  method  of  sub-band  coding  (SBC).  The  equiva¬ 
lence  of  the  two  methods  is  shown  under  certain  conditions.  The  Articu¬ 
lation  Index  (AI)  and  the  phoneme  transitional  information  related  to 
speech  intelligibility  are  discussed  along  with  their  incorporation  into 
the  coding  scheme  to  enhance  the  perception  of  speech.  The  results  of 
the  distribution  of  energy  from  the  prediction  residual  of  the  phonemes 
are  presented. 

Chapter  IV  presents  the  design  of  the  energy  based  sub-band  coding 
algorithm.  The  basic  ideas  associated  with  the  sub-band  coding  are  dis¬ 
cussed  as  related  to  the  proposed  coding  scheme.  The  adaptive  quantiza¬ 
tion  is  presented  to  explain  the  allocation  of  bits.  The  result  on 


signal-to-noise  ratio  (SNR)  performance  measurements  are  presented.  The 
computation  for  coding  the  prediction  residual  is  presented. 

Chapter  V  presents  a  summary  and  suggestions  for  further  study.  The 
appendixes  give  a  sample  of  the  related  speech  science  definitions,  com¬ 
puter  programs  for  coding  the  prediction  residual,  a  brief  review  of  the 
concept  of  Articulation  Index  and  sonagrams  of  speech  data. 


CHAPTER  II 


PREDICTION  RESIDUAL  AND  THE  PITCH  EXTRACTION 

2.1  Introduction 

Recent  work  in  the  area  of  speech  analysis  and  synthesis  is  based 
upon  a  model  that  separates  the  glottal  flow  from  the  vocal  tract.  That 
is,  the  speech  production  is  represented  by  a  convolution  model  where  the 
input  corresponds  to  the  glottal  volume  velocity  and  the  vocal  tract  by  a 
filter.  Recent  models  have  assumed  an  all  pole  filter  to  represent  the 
vocal  tract  [5].  The  filter  coefficients  are  determined  by  using  the 
method  of  linear  prediction.  By  using  the  inverse  filter,  the  speech  can 
be  deconvolved  to  obtain  the  prediction  error  or  residual.  The  block  di¬ 
agram  representing  this  is  shown  in  Figure  3.  The  residual  produces  a 
peak  where  the  prediction  is  bad,  representing  pitch  period  designations. 
As  the  prediction  becomes  more  accurate,  the  residual  appears  as  a  noisy 
signal . 

Most  synthesis  models  use  a  filter  excited  by  either  a  train  of 
quasi -periodic  pulses  or  a  random  noise  source  [60].  The  periodic  source 
excites  the  filter  for  voiced  sounds.  The  noise  source  excites  the  fil¬ 
ter  for  unvoiced  sounds.  The  prediction  residual  is  applicable  for 
voiced  or  unvoiced  sounds  because  the  residual  is  an  approximate  signal 
of  the  corresponding  input  sources  that  generate  these  sounds.  The  de¬ 
tailed  description  of  the  prediction  residual  is  discussed  in  Section  4 
of  this  chapter. 
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The  linear  predictive  techniques  described  so  far  have  been  used 
successfully  for  time-domain  speech  analysis  and  synthesis  [5]  [30].  The 
linear  predictive  coding  (L PC)  techniques  have  been  used  in  communica¬ 
tions  in  the  past;  however,  it  was  applied  to  speech  only  recently  [5] 
[7].  The  use  of  linear  prediction  in  describing  the  transfer  function  of 
the  vocal  tract  avoids  the  complexity  of  Fourier  analysis.  The  slowly 
time  varying  aspects  of  speech  can  be  taken  into  consideration  by  up¬ 
dating  the  filter  coefficients  every  so  often. 

Two  significant  contributions  have  been  made  by  Weiner  [61]  [62]  and 
Shannon  [63].  Weiner's  work  describes  prediction  and  filtering  of  ran¬ 
dom,  time  series  data.  Shannon's  results  describe  the  information  con¬ 
tent  of  a  message,  related  to  band-width  and  time  requirements  of  that 
message,  related  to  band-width  and  time  requirements  of  that  message. 

The  background  of  this  chapter  uses  Weiner's  method  as  applied  to  sta¬ 
tionary  data.  Shannon's  results  are  implicitly  used  in  the  coding 
scheme. 

Section  2.2  describes  the  basis  of  human  speech  production.  Section 
2.3  discusses  the  vocal  tract  model  as  a  discrete  time  invariant  linear 
filter.  Section  2.4  describes  a  parallel  between  the  glottal  waveform 
and  the  residual  signal.  Section  2.5  reviews  linear  prediction  analysis. 
Section  2.6  discusses  short-time  analysis.  Section  2.7  describes  the 
implementation  of  operations  for  the  calculation  of  the  prediction  resid¬ 
ual.  Section  2.8  presents  a  novel  pitch  extraction  technique. 

2.2  Mechanism  of  Speech  Production 

Man's  system  of  communication  is  by  speech.  Speech  is  produced 
through  the  human  vocal  system  in  a  continuous  fashion.  However,  speech 
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signals  are  composed  of  a  sequence  of  discrete  sounds  called  phonemes. 
Although  phonemes  are  not  bricks,  they  are  the  basic  sounds  that  serve  to 
make  a  complete  word  in  any  language.  The  connection  or  arrangement  of 
these  sounds  is  based  on  certain  rules.  It  is  the  study  of  these  rules 
and  the  way  these  sounds  fit  together  that  is  called  linguistics.  The 
basic  linguistic  element  is  called  a  phoneme.  Its  distinguishable  vari¬ 
ations  are  called  allophones  [2]. 

Speech  in  humans  is  produced  by  a  physical  acoustic  system  consist¬ 
ing  of  principally  four  parts:  lungs,  vocal  tract,  nasal  tract  and  vocal 
cords  (see  Figure  4).  The  lungs  supply  the  volume  of  air  necessary  to 
produce  speech.  The  vocal  tract  and  nasal  tract  act  as  filters  to  shape 
the  waveform.  The  velum,  a  small  flap  of  skin,  acts  as  a  switch  to  close 
the  entrance  to  the  nasal  tract.  When  closed,  it  removes  any  effect  the 
nasal  tract  may  have  on  the  sound  produced.  The  vocal  cords,  tongue, 
teeth  and  palate  are  parts  of  the  filter  or  constriction  mechanism.  An 
elongated  opening  between  the  folds  of  the  skin  which  make  up  the  vocal 
cords  is  called  the  glottis. 

The  vocal  tract  provides  the  column  of  air,  which  is  set  to  vibra¬ 
tion  by  the  excitation  of  the  glottis.  In  an  average  male,  the  vocal 
tract  is  about  17  centimenters  in  length.  The  cross-sectional  area  which 
is  determined  by  the  position  of  the  tongue,  lips,  jaw  and  velum  varies 
from  zero,  i.e.,  complete  closure,  to  approximately  20  square  centimeters. 

Speech  sounds  produced  by  the  system  can  be  separated  into  three 
distinct  classes  according  to  their  mode  of  excitation.  The  voiced 
sounds  are  produced  when  air  is  permitted  to  escape  in  quasi-periodic 
pulses  by  the  vibratory  actions  of  the  vocal  cords.  This  sets  the  acous¬ 
tic  system  to  vibrating  at  its  natural  frequencies.  These  resonant 
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frequencies  are  concentrations  of  energy  and  are  known  as  formant  fre¬ 
quencies.  These  are  useful  in  characterizing  the  vocal  tract  configura¬ 
tion,  as  there  is  a  one-to-one  correspondence  in  the  relationship  of 
vocal  tract  configuration  and  formant  frequencies.  The  fricative  or 
unvoiced  sounds  are  generated  by  forming  a  constriction  at  some  point 
along  the  vocal  tract  and  forcing  air  through  the  constriction  at  a  vel¬ 
ocity  high  enough  to  produce  turbulence.  This  can  be  identified  as  wide¬ 
band  noise  exciting  the  vocal  tract.  For  an  unvoiced  sound  the  vocal 
cords  are  relaxed  and  partially  open.  The  plosive  sounds  result  from  a 
complete  closure  of  the  vocal  tract  and  a  sudden  or  abrupt  release  of  the 
closure. 

The  formants  or  natural  resonances  are  numbered  ,  F2,  F^,  .... 
Typically,  for  speech  analysis,  only  the  first  three  or  four  are  used. 
Table  I  gives  representative  values  of  these  for  certain  vowels.  It  has 
been  noted  that  all  phonemes  characterize  some  formant  structure;  how¬ 
ever,  it  is  most  noted  for  voiced  sounds  [2].  It  is  indicative  of  the 
first  formant  to  be  greater  in  frequency  than  the  fundamental  frequency 
of  the  vocal  tract.  The  fundamental  frequency  is  the  rate  of  vibration 
of  the  vocal  cords;  whereas,  the  first  formant  represents  the  first  con¬ 
centration  of  energy  of  the  vocal  tract  system  excited  at  the  fundamental 
frequency.  Typically,  the  fundamental  frequency  is  around  120  Hertz  for 
men,  220  Hertz  for  women  and  300  Hertz  for  children.  The  pitch  period  is 
the  reciprocal  of  fundamental  frequency.  The  pitch  period  has  a  range 
from  three  milliseconds  to  eight  milliseconds  for  voiced  sounds.  For  the 
unvoiced  sounds,  most  frequencies  range  above  4000  Hz  and  it  has  approxi¬ 
mately  a  flat  spectrum.  All  voiced  sounds  are  characterized  by  voice  on¬ 
set  time  (V0T).  For  example,  plosives  are  characterized  by  V0T,  which  is 


AVERAGES  OF  FUNDAMENTAL  AND  FORMANT  FREQUENCIES 
AND  FORMANT  AMPLITUDES  OF  VOWELS  BY  76  SPEAKERS 
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the  delay  from  complete  closure  of  the  plosive  to  the  beginning  of  voicing 
[66].  The  VOT  ranges  from  25  milliseconds  to  300  milliseconds  depending 
on  the  phoneme. 

Each  phoneme  has  its  own  characterization  depending  on  the  language. 
This  characterization  is  associated  with  place  of  articulation  and  voic¬ 
ing.  In  this  thesis,  discussed  are  the  phonemes  of  the  English  language. 
This  is  not  to  discard  the  pitch  inflections  in  Chinese,  whispered  vowels 
in  Japanese  or  vocal  clicks  of  South  African  Hottentots,  but  to  restrict 
to  a  basic  area  to  all  languages.  This  is  established  by  the  Interna¬ 
tional  Phonetic  Association  (IPA).  Most  linguists  use  about  35  basic 
units,  and  six  diphthongs  or  combination  phonemes.  The  symbols  and  tele¬ 
type  representations  of  these  are  shown  in  Table  II. 

Phoneticians  classify  speech  sounds  by  vowels  and  consonants,  or 
strictly  speaking  in  the  manner  and  their  place  of  production.  Each  pho¬ 
neme  has  certain  characteristics  and  is  identified  from  the  distinctive 
features  of  the  speech  sound.  The  distinctive  features  give  a  unique 
identification  of  the  phoneme.  These  are  given  below  [68]. 

1.  Vocalic/Nonvocalic 

presence  vs.  absence  of  a  sharply  defined  formant  structure. 

2.  Consonant/Nonconsonant 
low  vs.  high  total  energy. 

3.  Interrupted/Continuant 

silence  followed  and/or  preceded  by  spread  of  energy  over  a  wide 
frequency  region  (either  as  a  burst  or  a  rapid  transition  of 
vowel  formants)  vs.  absence  of  abrupt  transition  between  sound 
and  the  silence. 


4.  Nasal/Oral 
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TABLE  II  (Continued) 


Standard 

IPA 

Teletype 

Representation 

Example 

9 

G 

(joat 

f 

F 

fault 

V 

V 

vault 

0 

TH 

ether 

t 

DH 

either 

S 

S 

sue 

z 

Z 

zoo 

f 

SH 

leash 

z 

ZH 

leisure 

h 

HH 

how 

m 

M 

sum 

n 

N 

sun 

n 

NX 

sung. 

1 

L 

laugh 

w 

W 

wear 

j 

Y 

young 

r 

R 

rate 

t  / 

CH 

chan 

d 

JH 

jar 

hw 

WH 

where 

Source:  Rabiner  and  Schafer,  Digital  Processing  of  Speech  Signals,  New 
Jersey:  Prentice-Hall,  1978,  p.  43. 


spreading  the  available  energy  over  wider  vs.  narrower  frequency 
regions  by  a  reduction  in  the  intensity  of  certain  (primarily 
the  first)  formants  and  introduction  of  additional  (nasal)  for¬ 
mants  . 

5.  Tense/Lax 

higher  vs.  lower  total  energy  in  conjunction  with  a  greater  vs. 
smaller  spread  of  the  energy  in  the  spectrum  and  in  time. 

6.  Compact/Diffuse 

higher  vs.  lower  concentration  of  energy  in  a  relatively  narrow, 
central  region  of  the  spectrum  accompanied  by  an  increase  vs.  a 
decrease  of  the  total  energy. 

7.  Grave/Acute 

concentration  of  energy  in  the  lower  vs.  upper  frequencies  of 
the  spectrum. 

8.  Flat/Plain 

flat  phonemes  in  contra-distinction  to  the  corresponding  plain 
ones  are  characterized  by  a  downward  shift  or  weakening  of  some 
of  their  upper  frequency  components. 

9.  Strident/Mellow 

higher  intensity  noise  vs.  lower  intensity  noise. 

A  table  for  the  distinctive  features  of  the  phonemes  of  English  are 
shown  in  Figure  5  [66].  As  indicated  above  the  features  may  be  of  two 
types.  The  presence  or  absence  of  each  feature  is  expressed  as  a  plus 
(+)  or  minus  (-).  For  example,  the  vocalic  category  has  vowels  shown  as 
plus  and  consonants  are  shown  as  minus. 
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2.3  Model  of  the  Vocal  Tract 

The  acoustic  speech  system  was  qualitatively  described  in  the  pre¬ 
vious  section.  The  acoustic  tube  model  of  the  vocal  tract  filter  can  be 
represented  as  a  discrete  time-invariant  linear  filter.  The  modeling 
has  been  discussed  in  the  literature  [2]  [7]  [67].  The  acoustic  tube  is 
approximated  by  a  number  of  sections  each  having  a  constant  cross-sec¬ 
tional  area.  The  cross-sectional  area  is  characterized  by  the  reflection 
coefficients.  The  reflection  coefficient  is  the  percentage  of  a  wave  re¬ 
flected  at  an  acoustic  tube  junction.  The  number  of  sections  in  the 
acoustic  tube  model  is  related  to  the  number  of  formants  for  a  phoneme. 

The  formants  of  speech  correspond  to  the  poles  of  the  vocal  tract 
transfer  function  [67].  As  pointed  out  in  the  last  section,  only  the 
first  three  or  four  formants  are  used  for  speech  analysis,  and  these  fre¬ 
quencies  are  below  5000  Hz.  Generally,  vocal  tract  resonances  occur 
about  one  per  thousand  Hertz  [67],  Therefore,  a  bandwidth  of  5  kHz  is, 
in  general,  sufficient  for  speech  analysis  and  synthesis.  Each  phoneme 

is  set  apart  from  the  others  by  the  frequency  location  of  the  formants. 

The  majority  of  phonemes  can  be  represented  by  an  all -pole  model  of 
the  vocal  tract  [5].  It  is  well  known  that  for  nonnasal  voiced  phonemes 
the  transfer  function  of  the  vocal  tract  has  no  zeros  [69].  Nasal  and 
glide  sounds  include  zeros  in  the  transfer  function.  Zeros  and  poles 
are  necessary  to  approximate  the  nasal  and  glide  sounds.  However,  it 
has  been  shown  that  zeros  in  the  vocal  tract  can  be  achieved  by  including 
more  poles  [5]. 

In  Figure  1,  let  the  transfer  function  of  the  vocal  tract  be  ex¬ 


pressed  by  [7] 
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V  ( z )  =  - ^ - :  (2.1) 

1  -  z  a.  z_i 
i=l  1 

where  G,  the  gain;  {a.},  the  filter  coefficients,  are  a  function  of  the 
cross-sectional  areas  of  the  acoustic  tube.  The  value  of  P,  the  order 
of  the  system,  is  usually  taken  as  twice  the  number  of  formants  for  anal¬ 
ysis  for  each  speech  sound.  Typical  values  for  P  range  from  8  to  10. 

The  value  of  10  has  been  used  for  lattice  network  representations  of  the 
vocal  tract. 

It  has  been  shown  that  given  (2.1),  a  lossless  tube  model  can  be 
found  [5]  [7].  Also,  given  an  acoustic  tube  with  all  areas  positive, 
Equation  (2.1)  describes  a  stable  system  [7]. 

2.4  A  Parallel  Between  Glottal  Waveform 
and  the  Residual  Signal 

In  modern  signal  processing  techniques,  it  is  necessary  to  use  as 
much  information  as  can  be  obtained  about  the  structure  of  the  signal. 
This  section  discusses  the  characteristics  of  the  residual  signal,  which 
is  the  output  of  the  linear  prediction  filter.  It  is  the  difference  be¬ 
tween  the  actual  and  predicted  speech  signals. 

The  residual  signal  used  in  this  thesis  is  obtained  by  using  the 
autocorrelation  method  in  the  LPC  algorithm.  In  doing  this,  the  speech 
is  Hamming  windowed,  where  the  window  function  is 

w(n)  =  0.54  -  0.46  cos  [  ]  0  <  n  <  N-l 

N-l  ~  ~ 

-  0  otherwise  (2.2) 

with  N  =  256.  The  computational  details  are  discussed  in  Section  2.6. 
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The  prediction  residual  is  the  ideal  signal  for  the  excitation  func¬ 
tion  for  LPC  analysis  [28].  It  contains  the  actual  information,  rather 
than  a  pulse  train  or  random  noise  as  in  the  simplified  linear  prediction 
models  [10].  The  waveform  that  excites  the  vocal  tract  is  the  glottal 
waveform,  and  the  residual  approximates  this. 

The  characteristics  of  the  prediction  residual  are  as  follows:  (1) 
it  marks  the  pitch  period,  (2)  it  has  basically  a  flat  amplitude  spec¬ 
trum,  (3)  phasing  information  is  embedded  in  the  prediction  residual,  (4) 
the  amplitude  spectrum  includes  details  related  to  the  suprasegmentals  of 
the  individual  and  the  spoken  words,  (5)  the  waveform  includes  the  fact 
that  voiced  fricatives  and  stops  are  a  combination  of  noise  and  a  repeti¬ 
tive  signal . 

Figure  6  gives  a  comparison  between  a  speech  wave  and  the  corre¬ 
sponding  prediction  residual  for  a  particular  phoneme.  The  computational 
aspects  in  obtaining  these  figures  will  be  discussed  later.  The  pitch 
period  is  marked  by  large  spikes  in  the  residual  signal.  The  residual 
gives  an  excellent  estimation  of  pitch  since  the  glottal  excitation  is 
clearly  marked. 

Figure  7  displays  an  unsmoothed  spectrum  of  the  residual  signal. 

The  spectrum  of  the  residual  contains  the  formants  also.  The  peaks  of 
the  formants  are  flattened;  however,  there  is  evidence  of  the  fundamental 
and  formant  frequencies  on  the  plot.  The  dashed  line  represents  a  smooth 
spectrum.  Even  in  this,  it  is  seen  that  there  is  evidence  of  the  funda¬ 
mental  and  the  formants. 

The  pitch  and  voicing  for  each  human  is  unique.  It  can  be  shown  by 
spectrograms  that  individuals  have  unique  voice  prints.  This  uniqueness 
is  basic  to  the  excitation  signal  rather  than  in  the  vocal  tract  filter. 

j 


TIME  INTERVAL  X  I  O'1 


Figure  6.  Speech  Waveform  and  Its  Prediction 
Residual  for  the  Phoneme  /a? /  over 
256  Sample  Interval 


tOI  X  Q3dvncs  9VIAI 


Figure  7.  Spectrum  of  Residual  Signal  for  the  Phoneme  /jp  / 
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Therefore,  the  suprasegmentals,  i.e.,  the  intonation,  dialect,  melody 
pattern,  etc.,  will  remain  unique  to  individuals  for  voiced  sounds. 

The  voiced  fricative  lends  more  benefit  to  this  discussion  than  its 
cognate,  the  unvoiced  fricative.  The  unvoiced  fricative  is  simply  a 
noisy  speech  waveform  that  produces  only  a  noisy  residual  signal.  The 
fricative  or  stop  is  produced  by  forcing  air  through  a  constriction,  such 
as  the  teeth  or  lips.  The  corresponding  sound  results  from  the  turbu¬ 
lence  and  is  of  the  noisy  type.  The  waveform  is  then  represented  by 
noise  that  can  be  shown  to  be  an  unvoiced  excitation  source.  However, 
the  voiced  fricative  is  a  result  of  a  constriction  in  the  vocal  tract 
while  the  vocal  cords  are  vibrating.  The  residual  signal  from  these  pho¬ 
nemes  produce  a  repetitive  signal  at  the  pitch  period. 

The  artificial  excitation  function  for  voiced  sounds  result  in 
speech  that  sounds  a  bit  unnatural.  The  use  of  the  prediction  residual 
in  coding  methods  would  introduce  naturalness  in  voicing.  Ideally,  the 
excitation  of  the  vocal  tract  filter  model  should  approximate  the  exci¬ 
tation  of  the  human  vocal  tract.  The  prediction  residual  meets  these 
requirements. 


2.5  Review  of  Linear  Prediction  Analysis 

Linear  prediction  analysis  uses  a  weighted  sum  of  P  successive 
speech  samples  to  predict  the  next  speech  sample.  The  weights  are  chosen 
such  that  the  mean-square  prediction  error  is  minimized.  Let 

xn  =  al  Vl  +  a2  xn-2  +  -  aP  Xn-P 
P 


(2.3) 


40 


where  xn  represents  the  speech  sample  sequence  and  ai  is  a  set  of  pre¬ 
dictive  coefficients.  In  this  application,  the  method  of  least  squares 
is  used.  Assuming  a  stationary  linear  system  [5]  with  time-invariant 
statistics,  zero  mean,  let  x^(n)  represent  the  best  estimate,  in  the 
least  mean-square  sense,  of  xn  using  the  ai ,  i  =  1,  ...,  P  coefficients 
and  let  xb(n)  be  the  best  backward  prediction  of  xn  using  the  b.. ,  i  = 

1,  ....  P  coefficients.  Then 

P 

xf(n)  3  l  a.  x  .  (2.4a) 

r  i=l  1  n  1 

P 

xb(n-P-1)  =  E  b.  xn_i  (2.4b) 


Let  e^-(n)  and  eb(n)  be  the  forward  and  backward  prediction  errors 
defined  by 


ef(n)  =  xn  -  xf(n) 


eb(n-P-l) 


-  z  a.  x 
f-0  1 


Vp-i  - 


P+1 
=  -  Z 

i=l 


b.  x  . 
i  n-i 


(2.5a) 


(2.5b) 


where  it  is  assumed  that  aQ  =  -1  and  bp+^  =  -1.  Figure  8  gives  the  im¬ 
plementation  of  (2.5) 

Since  stationarity  is  assumed,  it  follows  that  the  errors  can  be 
minimized  by 


E 


(n 


0 


j  3  1 


(2.6a) 


.-ij) 
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Figure  8.  Implementation  for  Generation  of  Forward  and 
Backward  Prediction  Errors 


1- 


3§7  ( eb ( n-P-1 ))*  1=0  j  =  1,  ....  P  (2.6b) 


These  reduce  to 


1=E0  ai  Etxn-i  Vj]  “  E[xn  Vj1 


j  *  1 . P  (P.7) 


l  b.  E[x  .x  .]  =  E[x  D  ,  x  .]  j  =  1,  ...,  P  (2.8) 

i=0  i  L  ri-i  n-jJ  L  n-P-1  n-jJ  '  ' 


By  defining 


E[x  .x  .]  =  R.  . 

n-^  n-j  i-j 


(2.9) 


Equations  (2.7)  and  (2.8)  can  be  expressed  by 
P 


5:  R .  .  a .  -  R . 

1>1  1'J  1  J 


j  =  1,  ....  P 


(2.10a) 


E  R  ■  .  b . 


i  =  l 


i-J  i 


Rp+]-j  J  =  1 »  • • • »  p 


(2.10b) 


where  R^_j  =  Rj_.  has  been  used.  It  is  clear  that  (2.10a)  and  similarly 
(2.10b)  can  be  written  in  a  matrix  form,  wherein  the  coefficient  matrix 
is  a  symmetric  Toeplitz  matrix  [71].  Furthermore, 


bi  dP+l-i 


i  =  1,  ...,  P 


(2.11) 


which  can  be  seen  by  defining 


J  = 


P+1  -  e. 
P+1  -k 


in  (2.10a).  That  is. 


k-e  P+1  -k  P+1 -A 
k=P 


which  can  be  rewritten  as 


>=1  Ri-J  dP+l-i  =  RP+1-J 


j  =  1,  ....  P 


(2.12) 


Comparing  (2.12)  with  (2.10),  the  relation  in  (2.11)  can  be  seen  The  for 
ward  prediction  error 

5  «<*„  ■  ai  Vi)(xn  •  j',  ai  vi» 

'  EK’  -  2  ai  Vn-i]  *  E[/,  aj  j,  ai  Xn-i  Vj] 


EIV]  ■  ai  \  VlJ 


R  -  i  a .  R .  "  E 

&  4  =  i  1  1  P 


(2.13) 


where  (2.7a)  has  been  used  to  obtain  (2.13). 

The  cross-correlation  between  the  forward  and  backward  prediction 
errors  is  derived  in  the  following.  Let 

Cp+1  -■  E[ef(n)  eb(n-P-l)j 


E[x(n)  x(n-P-l)]  -  E[  ap+1_j  xpxn_j] 


r  p  p 

E[  11  a.  x_  .  x„  B  .]  +  E[  l  anj,  ,  E  a.  x  .  x  .1 
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RP+1  '  ai  RP+l-i 


(2.14) 


where  again  (2.7a)  has  been  used  to  obtain  (2.14). 

It  is  clear  that  (2.13)  and  (2.14)  correspond  to  P  coefficients.  In 
the  following,  a  recursive  method  will  be  used  wherein  the  coefficients 


a.j  will  be  updated.  For  this  reason,  let 


E  =  R 
o  o 


E  =  R  -  2  a!k*  R. 

k  0  i=l  1  1 


Ck+1  =  R 


,(k) 


-  r  a(k)  R 
k+l  i^1  ai  Kk+l-i 


(2.15) 

(2.16) 

k  =  0,  1,  ....  P-1  (2.17) 


k  =  1,  ...,  P 


where  a)  '  are  determined  from  (2.10a)  by  using 


E  R.  .  a^  =  R .  j  =  1 . k 


i-1 


i-J  i 


(2.18) 


(kl 

Durbin's  method  [72]  [73]  can  now  be  used  to  solve  for  a^  '  in 


(2.18).  The  corresponding  equations  are 


E  =  R 
o  o 


(2.19) 


.  cjjj. 

kj+i 

ej 

Jj+i) 

dj+i 

‘  kj*i 

a<J+1> 

.  a(j)  - 

l 

E3+l 

■  ej  " 

k  .  .  a  . , 
J+l  J+ 


3+1 


.Sii-i 


j  =  0,  1,  ...,  P-1 

1  =  1,2 . j  (2.20) 


i 
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The  predictive  coefficients  are  obtained  from 


ai 


=  a 


(j+U 


1,  2,  P 


Interestingly,  the  prediction  residual  in  (2.20)  is  readily  avail¬ 
able  in  the  algorithm  for  the  predictor  of  order  j.  The  coefficients  k. 
generated  in  (2.20)  are  usually  referred  as  PARCOR  coefficients.  These 
have  some  interesting  characteristics  [9]  [28]. 

'•  'kj!  -  1 

2.  Since  |k^|  is  unity  bounded,  a  set  quantization  levels  can  be 
determined. 

3.  The  PARCOR  coefficients  are  the  result  of  the  orthogonal ization 
of  the  auto-correcation  matrix. 

In  order  to  show  the  application  of  this  system,  the  transfer  func¬ 
tion  and  the  algorithm  to  acquire  the  prediction  residual  is  derived 
below. 

The  transforms  of  ef(n)  and  eb(n-P-l)  in  (2.5)  can  be  expressed  in 
terms  of 


P 

Ef(z)  =  -  Z  a.  z"1  X(z) 
i=0 


Z-(P+1) 


EbU) 


z  b.z_i  X(z) 
i  =  l  1 


(2.21a) 


(2.21b) 


where  Ef(z),  Eb(z)  and  X(z)  are  the  transforms  of  ef(n),  eb(n)  and  x(n), 
respectively.  Note  that  aQ  and  bp+1  in  (2.21)  are  each  equal  to  -1.  For 
simplicity,  let 


Ap(z) 


P 


z 

i=l 


a 


i 


1 


(2.22a) 
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Bp(z) 


Z-(P+1> 


P 

I 

i=l 


bi 


(2.22b) 


with  these,  (2.21)  can  be  written  as 

Ef(z)  =  A p(z)  X ( z )  (2.23a) 


Eb(z)  =  zP+1  Bp ( z )  X ( z ) 


(2.23b) 


It  is  clear  that  (2.22)  was  implemented  in  Figure  8  using  the  direct 
form.  Next,  the  lattice  network  implementation  of  (2.22)  is  discussed 
below.  In  order  to  do  this,  recall  the  relation  b.  =  ap+-|_-j  given  in 
(2.11).  With  the  relation,  (2.21b)  can  be  written  as 


Bp(z) 


Z-(P+D 

Z-(P+D 

Z-(P+D 


j=1 

Ap(z_1 


aP+l-i  z 

.  7-(P+D+j 
aj  Z 

) 


(2.24) 

(2.25) 


From  this  it  follows  that 

Ap(z)  =  z'(P+1)  Bp(z_1 )  (2.26) 

Equations  (2.20),  (2.25)  and  (2.26)  will  now  be  used  to  derive  the 
lattice  implementation.  To  develop  the  recursive  equation  for  the  lattice 
formulation,  some  of  the  above  equations  have  to  be  written  in  a  recursive 
manner.  It  is  clear  that  (2.22)  can  be  rewritten  in  the  form 

j+i  fi+ll  _i 

A-+1 (z)  -  -  z  a  J  |;  z  j  -  0,  1,  ...,  P-1  (2.27a) 

J  1  i=0  1 


.j| 
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Bj+1(z) 


(j+2) 


I  bSj+1)  z"1  j  =  0,  1,  ...»  P-1  (2.27b) 


i  =  l 


where  the  superscripts  on  a.  and  b^  are  included  to  denote  that  (j+l)th 
order  is  implemented  rather  than  a  Pth  order.  Also 

Jj+U 


=  -1 


(2.28) 


b(j+1)  =  -1 

j+2  1 


(2.29) 


have  been  used.  The  remaining  ap+^  can  be  expressed  in  terms  of  a(J^ 
using  (2.20);  are  related  to  a^+^  by  [see  (2.4)] 


bi 


(j+1) 


a(j+i) 

aj+i-i 


(2.30) 


Using  (2.20),  (2.29)  and  (2.30)  in  (2.27a) 


Vl(z)  ■  >  -  'fj  (*{J)  -  Vl  "j+l-i1  2 


-1 


(j+1)  m 

AJ‘*>  -  Vi  +,  V  2 


■  ftJ(2)  -  kj*iV2) 


(2.31) 


Using  (2.24) 


Vl<2>  -  z-«+2>  Aj+j  (z-1) 


(2.32) 


Equation  (2.31)  can  be  rewritten  as 


,-l 


Aj+1(z"')  =  A,(z-1)  -  k.+1  B.(z'‘) 


j+1  Dj' 


(2.33) 


Substituting  (2.32)  in  (2.33)  and  simplifying 
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Bj+1(z)  =  z_1[Bj(z)  -  k.+1  Aj(z)l  (2.34) 

Equations  (2.31)  and  (2.34)  define  the  algorithm.  The  implementa¬ 
tion  of  these  is  shown  in  Figure  9,  where  the  generation  of  is  also 
included.  The  detailed  structure  of  the  optimum  inverse  filter  as  an 
analysis  model  is  shown  in  Figure  10a.  The  corresponding  synthesis  model 
is  shown  in  Figure  10b.  The  output  of  the  synthesis  filter  is  the  input 
speech  signal.  From  the  analysis  section,  transform  of  the  prediction 
residual  is  A^(z) . 

2.6  Short-Time  Analysis 

The  concept  of  short-time  Fourier  analysis  [76]  [77]  is  fundamental 
for  coding  the  residual  signal.  For  a  quasi -periodic  signal  such  as 
speech,  the  short-time  or  time-dependent  Fourier  analysis  allows  for  a 
detailed  study. 

The  speech  signal,  x(m),  m  =  0,  1,  ...»  L-l,  from  Equation  (2.3)  is 
segmented  into  r  sections  such  that  short-time  spectral  analysis  can  be 
used.  It  is  assumed  that  L  =  rN,  where  N  corresponds  to  the  number  of 
samples  in  each  section.  This  assumes  the  use  of  the  formula 

CO 

z  w(nD-m)  =  1  (2.35) 

n=-<» 

where  w(m)  corresponds  to  a  band  limited  function  to  a  frequency  of  1/2D, 
and  I)  is  the  period  (in  samples)  between  adjacent  samples  of  the  short- 
time  transform  of  the  signal  [77].  In  all  practical  cases,  w(m)  is  a  time 
limited  signal  and,  therefore,  its  spectrum  cannot  be  band  limited.  The 
effects  of  this  non-band  limited  case  are  discussed  in  a  recent  paper 
[93].  It  has  been  shown  that  the  aliasing  errors  are  small  and  can  be 
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neglected  if  D  is  properly  chosen.  For  a  Hamming  window,  D  =  (N/4)  [77]. 
In  addition  to  the  aliasing  errors,  end  effect  errors  have  to  be  consid¬ 
ered  also  [94],  This  is  necessary  since  L  if  finite.  The  LPC  analysis 
is  applied  to  the  windiwed  signal  resulting  in  the  windowed  residual 
signal.  The  overlap-add  of  this  signal  is  the  residual  signal,  which  has 
error  identified  earlier. 

The  short-time  Fourier  transform  of  the  residual  signal  e^(m)  can  be 
defined  as  [67] 

jw.  “  -jw.m 

X  (e  k)  =  e  [ef (m)  w(nD-m)]  e  (2.36) 

1)1=  -00 


where  w,  =  (2  k/N),  k  =  0,  1,  . N-l,  and  w(m)  corresponds  to  a  window. 
For  a  particular  value  of  n.  Equation  (2.36)  can  be  implemented  using 
FFT.  This  is  used  in  this  thesis.  A  brief  review  of  this  is  presented 
below. 

Let 


e  (m)  =  e^ ( nD+m)  w(-m)  1  m  1  ®  (2.37) 


Using  this  in  (2.36), 


jm.  ■»  m  -ju.n 

Xn(e  k)  =  [  i  e  (m)  e  k  ]  e 

i!  m=-tB 


(2.38) 


Further,  let  m  =  Nr+q,  —  •  r  <  »,  0  q  <  N-l.  With  these, 


J"'i, 


N-l 


X  (e  )  -  z  [  i:  en(Nr+q)  e 

r=-®  q=0 


-j»v  (Nr+q )  -jw.n 

]  e  (2.39) 


-jwtNr 

Noting  that  e  =  1 , 
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X  (e 
n 


Jw, 


N-l  “  -jw.q  -jw.n 

Z  [  X  e  ( Nr+q )  ]  e  e 

q=0  r=-«°  n 


For  simplicity,  let 


(2.40) 


oo 

u  (q)  =  Z  e  (Nr+q)  0  <  q  <  N-l  (2.41 ) 

^  -oo 


Note  that  u  (q)  is  periodic  with  period  N.  Now  Equation  (2.40)  can  be 
written,  and  is 


X  (e 
n 


Jo, 


-jokn 


N-l  -jw.  q 

[  ^  u(q)  e  k  ] 
q=0  n 


(2.42) 


Observe  that 
quence  u  (q). 


Jo.  -jw.  n 

Xn(e  )  is  represented  as  e  times  the  DFT  of  the  se- 

Therefore,  (2.42)  can  be  written  as 


jw. n  jw.  N-l 

e  X  (e  k)  =  2  u  ( (m-nD) )N  e 

n  q=0  n  n 


-jwkq 


(2-43) 


Equation  (2.42)  represents  the  DFT  form,  where  (('))N  corresponds  to  the 
modulo  N. 

The  following  procedure  can  be  used  to  compute  (2.43). 

1.  The  windowed  sequence,  en(m),  can  be  computed  from  (2.37).  The 
sequence  can  then  be  divided  into  r  sections  of  N  samples  each,  where  in 
this  thesis,  L  =  4096,  N  -  256,  D  =  64,  and  r  =  16. 

2.  The  M-point  DTT  of  un((m-nD))N  can  be  computed  to  obtain  (2.43) 
using  FFT. 

The  above  procedure  is  given  here  for  generality.  Due  to  the  limi¬ 
tation  of  the  disc  space  and  to  reduce  computational  time,  a  slightly 
different  procedure  is  used  in  computing  the  spectral  analysis.  The 
residual  signal  is  rectangular  windowed  to  256  points,  spectrum  analyzed 
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and  then  averaged.  This  is  used  only  for  phonemes  discussed  in  the  the¬ 
sis.  No  overlapping  was  used.  The  errors  associated  with  this  method 
are  quantified  in  previously  mentioned  references  [93]  [94]. 

2.7  Implementation  of  Operations  for  the 
Calculation  of  the  Prediction  Residual 

In  this  section,  the  formulation  of  the  prediction  residual  from  the 
speech  input  is  presented.  The  implementation  of  the  operations  to  cal¬ 
culate  the  prediction  residual  represents  the  analysis  model  for  LPC.  The 
analysis  model  consists  of  the  speech  as  the  input,  the  vocal  tract  model, 
the  correlation  coefficients  and  the  residual  as  the  output. 

The  analog  speech  signal  is  band  limited  to  3600  Hertz  using  a  second 

order  Butterworth  filter.  This  signal  is  digitized  at  the  rate  of  8000 
samples/second.  The  algorithm  for  digitization  is  named  DIGITIZ  and  the 
computer  program  is  included  in  Appendix  B. 

The  results  in  the  last  section  are  used  to  obtain  the  windowed  dig¬ 

itized  data.  This  allows  to  process  the  speech  in  short  segments.  The 
underlying  assumption  for  most  speech  processing  schemes  is  that  the  prop¬ 
erties  of  the  speech  signal  change  relatively  slowly  with  time  [67].  This 
assumption  leads  to  short-time  methods  which  isolate  the  signal  during  the 
segment  of  windowing.  The  window  is  a  256-point  Hamming  window  and  is 
overlapped  at  64-point  intervals.  The  windowing  is  computed  by  program 
WINDOW  in  Appendix  B. 

The  windowed  signal  is  passed  to  program  AUTO  [7].  This  program  uses 
the  autocorrelation  method  for  solving  the  matrix  equation  (2.10)  for  the 
predictor  coefficients  [61].  The  other  matrix  values  solved  for  are  the 
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reflection  coefficients  or  PARCOR  coefficients.  These  values  are  passed 
for  use  in  the  lattice  formulation. 

The  lattice  method  represents  a  recursive  algorithm  for  a  solution 
of  the  prediction  residual.  This  method  guarantees  stability.  Note  that 
the  PARCOR  coefficients  are  bounded.  The  program  to  calculate  the  resid¬ 
ual  by  the  lattice  formulation  is  INVERS  and  is  included  in  Appendix  B. 

Figure  11  illustrates  a  block  diagram  showing  the  sequence  of  opera¬ 
tions  related  to  the  calculation  of  the  prediction  residual,  e^(n). 

2.8  A  Novel  Approach  to  Pitch  Extraction 

2.8.1  Types  of  Probl ems  Associated  with  Pitch 
Extraction 

The  pitch  extractor  is  of  prime  importance  in  most  speech  processing 
systems,  as  the  pitch  is  one  of  the  basic  parameters  in  speech  analysis 
and  synthesis  studies.  In  low-bit  rate  systems,  it  is  an  essential  com¬ 
ponent  [2]  [7].  Speech  with  a  constant  fundamental  frequency  is  perceived 
as  a  monotone  or  of  a  synthetic  nature;  variable  pitch  lends  to  speech  a 
melody.  An  accurate  pitch  extractor  is  a  challenging  area  of  speech  pro¬ 
cessing. 

The  difficulty  in  accurately  determining  pitch  is  due  primarily  to 
the  time  varying  aspects  of  the  glottal  excitation.  Since  the  model  of 
the  vocal  tract  assumes  quasi -periodic  changes  occurring  along  the  acous¬ 
tic  tube,  the  glottal  response  is  not  predicted  accurately.  This  innac- 
curacy  is  due  to  the  nonuniform  train  of  periodic  pulses  that  occur  with 
the  golttal  waveform.  The  simple  model  of  the  vocal  system  excitation, 
i.e.,  periodic  uniform  pulses  or  Gaussian  noise,  eases  the  measurement  of 
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of  the  period  of  the  pitch.  However,  when  the  pitch  and  the  waveform  are 
changing  within  a  period  which  occurs  with  frequency  shifts,  difficulty 
arises. 

The  second  problem  associated  with  the  measurement  of  pitch  is  due 
to  the  nonseparability  of  the  vocal  tract  model  from  the  glottal  excita¬ 
tion.  That  is,  the  separation  of  the  formants  and  the  fundamental  fre¬ 
quency  may  not  be  possible  and  therefore  the  detection  of  the  pitch 
period  is  difficult.  This  interaction  can  be  seen  most  often  during 
transitional  regions  of  formants  when  the  articulatory  elements  are 
changing. 

The  third  problem  is  the  detection  of  the  beginning  and  ending  of 
the  pitch  period.  Part  of  this  problem  occurs  in  the  definition  of  be¬ 
ginning  and  ending  of  the  pitch  period.  In  examining  the  speech  wave¬ 
form,  it  is  necessary  to  always  be  consistent  with  the  method  because 
different  definitions  will  often  lead  to  different  results.  This  is  seen 
in  Figure  12.  In  Figure  12,  one  can  detect  the  period  of  zero  crossings 
before  the  maximum  peaks  or  detect  the  period  between  the  maximums.  How¬ 
ever,  the  two  methods  do  not  always  give  the  same  answers.  The  discrep¬ 
ancy  between  the  two  is  due  to  the  slowly  time-varying  properties  of 
glottal  excitation. 

The  fourth  problem  that  arises  is  the  decision  to  ascertain  which 
segment  of  speech  is  voiced  or  unvoiced.  In  particular,  some  algorithms 
have  problems  distinguishing  between  low-level  speech  and  unvoiced  speech. 
In  transitional  analysis,  it  is  difficult  to  pinpoint  the  difference 
between  the  two. 

In  addition  to  the  above  problems,  the  pitch  detection  is  hindered 
further  when  the  signal  is  a  transmitted  speech  signal.  During  the 
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transmission  of  a  speech  signal  over  a  telephone  line,  there  are  degrada¬ 
tions  that  occur  that  can  change  the  signal  to  make  pitch  detection  dif¬ 
ficult.  These  include:  1)  phase  distortion,  2)  amplitude  modulation  of 
the  signal,  3)  crosstalk  between  messages,  4)  clipping  of  high-level 
sounds.  Furthermore,  as  the  signal  travels  through  the  telephone  lines, 
the  lines  act  as  a  bandpass  filter  with  approximate  band  edges  f^  =  200 
Hertz  and  f 2  =  3200  Hertz.  The  fundamental  frequency  is  usually  less 
than  200  Hertz  and  therefore  is  removed  by  the  bandpass  action  of  the 
line.  The  pitch  must  be  regenerated  by  using  harmonics. 

The  next  section  discusses  advantages  and  disadvantages  associated 
with  the  use  of  the  prediction  residual  for  pitch  extraction. 

2.8.2  Advantages  and  Disadvantages  for  Using 
the  Prediction  Residual  as  a  Source  for  Pitch 
Extraction 

The  prediction  residual  solves  the  problem  of  vocal  tract  excitation. 
Earlier,  it  is  stated  that  there  is  inaccuracy  in  determining  glottal  re¬ 
sponse  when  using  the  simple  model  for  excitation.  When  using  the  two- 
source  model  fo,  the  vocal  system  excitation,  i.e.,  quasi -periodic  pulses 
and  random  noise,  a  simple  algorithm  can  be  used  for  extraction  of  pitch. 
The  residual  can  be  used  as  a  single  source  as  an  approximation  to  the 
glottal  excitation,  and,  therefore,  a  simple  method  can  be  used  to  employ 
the  residual  to  extract  pitch. 

It  is  well  known  that  the  residual  represents  the  deconvolution  of 
the  speech  from  the  vocal  tract  [7],  For  each  vocal  tract  configuration, 
a  different  set  of  formants  and  a  variation  in  harmonics  of  the  fundamen¬ 
tal  frequency  in  the  spectrum  is  acquired.  The  pitch  markings  are 
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determined  by  residual  spikes  in  the  time-domain.  This  can  be  used  to 
extract  the  pitch  accurately. 

The  advantage  of  an  accurate  estimation  of  pitch  will  aid  to  the 
perception  of  speech.  Any  enhancement  to  perception  is  important  to  any 
analysis-synthesis  speech  system.  The  discussion  which  follows  includes 
other  reasons  for  using  the  prediction  residual  as  a  source  for  pitch 
extraction. 

Referring  to  Figure  12,  it  is  shown  where  errors  can  occur  when  the 
speech  signal  is  used  for  pitch  extraction.  Figure  13  shows  the  residual, 
over  256  samples,  characterized  by  spikes  which  represent  the  pitch  per¬ 
iod.  It  can  be  seen  that  it  is  not  necessary  to  account  for  the  zero 
crossings  or  maximums.  It  is  simply  a  matter  of  tracking  absolute  maxi- 
mums  within  the  range  of  the  established  pitch  period.  It  has  been  shown 
that  if  the  interval  of  analysis  is  small  enough  the  residual  can  be  used 
to  extract  pitch  accurately  [28].  Future  transmission  rates  will  require 
a  system  that  can  do  an  acceptable  performance  for  extracting  pitch. 

An  application  for  using  the  residual  signal  for  pitch  extraction  is 
with  embedded  coding.  The  advantage  with  the  residual  signal  is  that  an 
absolute  pitch  can  be  determined  in  a  frame.  At  higher  transmission 
rates,  the  coding  of  the  residual  can  be  accomplished  more  efficiently. 
Therefore,  a  pitch  extraction  method  can  be  employed  easily.  Flowever, 
it  is  not  feasible  to  transmit  the  residual  with  low  transmission  rates; 
consequently,  the  higher  rates  must  extract  the  pitch  and  transfer  this 
to  the  lower  rates.  Since  the  residual  demonstrates  a  very  accurate 
representation  of  the  pitch,  the  frame-by-frame  analysis  of  the  pitch 
from  the  prediction  residual  would  enhance  pitch  in  an  embedded  coding 
scheme. 


However,  a  disadvantage  associated  with  the  residual  may  occur  in  a 
high  noise  environment.  It  has  been  shown  earlier  that  the  residual  is  a 
combination  of  periodic  and  noisy  signals.  In  a  high  noise  environment, 
the  noise  may  overcome  the  residual  signal.  If  the  noise  has  amplitude 
in  the  range  of  pitch  markings,  the  signal  would  require  enhancement  to 
extract  pitch  adequately.  On  the  other  hand,  low  noise  contributes  to 
the  flatness  of  the  spectrum  of  the  signal  and  enhances  pitch  extraction. 

Several  advantages  and  disadvantages  have  been  discussed.  It  can  be 
readily  seen  that  the  residual  is  an  ideal  signal  for  extraction  of  the 
pitch.  The  next  section  discusses  the  implementation  of  the  pitch  extrac 
tor. 

2.8.3  A  Novel  Pitch  Extractor 

The  last  few  sections  have  described  the  prediction  residual  as  the 
result  from  the  linear  prediction  analysis.  It  has  been  shown  that  the 
prediction  residual  contains  much  information  needed  for  extracting  pitch 
It  is  a  simple  problem  to  pick  appropriate  peaks  to  extract  the  pitch. 

It  is  this  problem  of  pitch  extraction  that  has  interested  many  authors 
recently. 

Examining  Figure  13,  a  repetitive  waveform  is  seen  at  the  period 
called  the  pitch  period.  Note  that  the  waveform  has  a  noisiness  which 
implies  a  flat  amplitude  spectrum.  It  should  be  noted  that  for  voiced 
sounds  there  are  other  peaks  that  are  also  repetitive.  These  are  evi¬ 
dence  of  the  formant  frequencies.  They  are  somewhat  dampened;  however, 
this  is  to  be  expected  since  the  linear  predictive  filter  has  the  charac¬ 
teristic  of  spectrally  flattening  the  signal. 
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It  is  well  known  for 

x(t)  =  A  cos  (2wfo)  (2.44) 

where  A  is  the  constant  maximum  amplitude  of  the  signal,  and  f  is  the 
fundamental  frequency,  the  spectrum  is 

X(f)  =  f  6(f-f0)  +  -  6(f+fo)  (2.45) 

It  can  be  said  that  the  speech  waveform  is  a  combination  of  sinusoids  of 
the  type  given  in  (2.44)  summed  together  in  a  quasi-periodic  fashion. 

The  residual  signal,  e^(n),  can  be  described  in  a  similar  fashion. 
Therefore,  it  follows  that  the  spectrum  of  e^(n)  has  impulses  at  the  fun¬ 
damental  and  its  harmonics  identified  here  by  f  ,  f, ,  ...,  f  .  The  maxi- 
mum  amplitude  is  centered  at  fQ,  the  fundamental  frequency  [75].  The 
higher  frequencies  are  all  harmonics,  or  multiples  of  fQ.  An  a  priori 
estimate  of  fQ  for  a  speech  sound  can  be  found  using  the  residual  as  in¬ 
put.  If  the  spectrum  is  available,  then  the  frequency  of  the  maximum 

amplitude  determines  an  estimate  of  f  .  This  estimate  is  found  to  be 

o 

relatively  accurate  for  speech  and  the  prediction  residual.  In  the  fol¬ 
lowing,  a  procedure  for  extracting  the  fundamental  is  given. 

The  initial  step  is  square  the  residual.  This  has  a  dual  benefit 
in  addition  to  making  all  calculations  positive.  First,  it  makes  large 
quantities  larger  and  second,  any  small  or  noise-like  quantities  are  made 
smaller.  The  new  data  corresponding  to  the  set  of  squared  samples  are 
placed  in  frames  of  256  samples  each. 

following  the  initial  step,  the  original  sample  rate  is  used  to 
determine  the  time  difference  between  maximums.  It  is  assumed  that  the 
maximums  mark  the  beginning  of  a  new  pitch  period  as  set  by  a  threshold. 
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The  threshold  is  used  to  select  the  next  maximum.  The  data  set  is  passed 
through  a  peak  picker.  The  peak  picker  uses  the  threshold  to  determine 
the  next  peak  (maximum).  The  time  between  the  two  peaks  is  calculated  by 
a  differencer  function.  The  system  is  ready  to  set  a  pitch  value  from 
the  time  between  the  two  peaks. 

At  this  point,  the  a^  priori  estimate  of  the  pitch  and  the  pitch 
value  from  the  differences  are  averaged.  An  error  check  is  made  for 
erroneous  pitch  values.  The  error  check  compares  a  range  of  pitch  from 
a  low  to  a  high  value.  Should  the  averaged  value  be  less  or  more  than  a 
set  threshold,  an  update  is  sent  to  recalculate  the  last  averaged  pitch 
in  the  frame.  The  process  is  continued  until  the  end  of  the  frame  where 
the  pitch  is  set.  The  procedure  for  estimating  pitch  is  shown  in  Figure 
14.  The  next  section  discusses  the  results  in  using  the  pitch  extractor. 

2.8.4  Pitch  Extraction  Results 

The  PITCH  program  was  applied  to  39  phonemes,  including  16  vowels 
and  diphthongs  and  23  consonants.  Each  sound  was  held  from  one-quarter 
second  to  one  second  by  a  male  speaker  at  normal  intensity.  Recordings 
were  made  on  a  SONY  Model  TC-106A  tape  recorder  under  anechoic  conditions. 
The  sounds  were  low-pass- filtered  by  a  Butterworth  filter  with  a  cutoff 
frequency  of  3600  Hertz  and  samples  at  8000  Hertz  with  nine  quantization 
bits  and  one  sign  bit.  The  computer  system  quantization  level  setting 
was  ±10  volts.  This  gave  a  quantization  level  of  20  millivolts.  The 
digitized  sound  was  sampled  for  1.5  seconds  using  12000  data  points  for 
storage.  With  a  limited  computer  system  memory,  the  beginning  of  the 
sound  was  found  and  4096  points  were  saved.  The  sound  was  stored  for 
later  use  and  labeled  with  an  appropriate  name.  Due  to  the  processing 


PITCH  I-STIMATION 


Figure  14.  Block  Diagram  of  Pitch  Extraction  Method 
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of  the  INTERDATA  70  computer  system,  all  processing  in  program  PITCH  is 
done  in  256-point  blocks. 


The  unified  recursive  solution  to  solve  Equation  (2.10a)  is  by  pro¬ 
gram  AUTO  [7].  The  efficient  resursive  solution  was  discussed  in  an 
earlier  section.  The  program  INVERS  processes  the  data.  The  residual 
data  is  stored  for  use  in  the  program  FFTMGR.  The  computation  of  the 
spectrum  is  performed  in  this  program.  Spectral  values  are  stored  for 
use  in  PITCH. 

Samples  can  be  plotted  for  any  segment  of  the  sound  to  aid  in  visual 
determination  of  the  pitch  period.  An  example  is  shown  in  Figure  13. 

The  results  show  that  the  method  presented  here  is  an  adequate  and  accur¬ 
ate  method  for  determining  pitch  period.  It  is  compared  to  Peterson  and 
Barney's  data  [111].  Table  III  gives  a  comparison  between  this  data  and 
the  results  obtained  from  this  method.  From  this  table,  it  can  be  seen 
that  the  results  are  good.  The  voiced/unvoiced  decision  is  not  a  product 
of  PITCH.  The  FFTMGR  routine  produces  an  energy  level  for  the  determina¬ 
tion  of  voicing.  Voicing  errors  were  made  25  percent  of  the  time.  This 
is  due  to  the  fact  that  the  threshold  is  set  to  a  low  level.  However, 
the  error  check  will  restrict  any  wide  variance  of  pitch.  If  the  calcu¬ 
lated  fundamental  frequency  is  larger  than  a  set  threshold  value  of  400 
Hertz,  then  the  corresponding  sound  is  considered  as  an  unvoiced  sound 
and  no  further  calculations  are  made.  These  two  checks  allow  for  accur¬ 
ate  measure  of  pitch  and  voiced/unvoiced  decision. 

Error-free  pitch  estimation  is  critical  to  the  overall  performance 
of  any  low-bit  rate  coding  system.  Coding  systems  that  incorporate  the 
residual  signal  for  estimation  of  the  pitch  are  accurate  and  adequate. 
Accuracy  can  be  enhanced  by  using  the  residual  in  a  minimum  noise 
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environment.  The  residual  signal  formulates  a  true  pitch  per  frame  that 
can  be  used  at  low,  synthetic  transmission  rates.  There  are  several  rea¬ 
sons  that  have  been  discussed  to  show  that  this  is  a  novel  approach  to 
pitch  extraction  and  is  summarized  below.  First,  it  is  a  two-stage  method 
that  estimates  the  residual  spectrum  and  uses  time  samples  of  the  residual 
to  calculate  the  approximation  of  the  pitch.  Second,  the  calculation  is 
done  by  a  thresholding  technique  which  uses  the  square  of  samples.  Fi¬ 
nally,  the  extraction  of  the  pitch  includes  an  error  check  that  estimates 
wide  variances  of  the  pitch  during  each  frame.  From  these,  it  can  be 
seen  that  this  method  can  be  considered  as  a  hybrid  technique. 


TABLE  III 

COMPARISON  OF  FUNDAMENTAL  FREQUENCIES 


Phoneme 

Fundamental 

from 

Peterson-Barney  [111] 

Frequency  (Male) 
from 

Proposed  Method  of 
Pitch  Extraction 

/i/ 

136 

129 

/I/ 

135 

130 

hi 

130 

125 

Icvl 

127 

135 

/«/ 

124 

123 

m 

129 

135 

hi 

137 

126 

hi 

141 

151 

/A/ 

130 

123 

/*/ 

133 

140 

2.9  Summary 


In  this  chapter,  the  characteristics  of  the  prediction  residual  were 
presented.  There  is  a  parallel  between  the  glottal  waveform  and  the  re¬ 
sidual.  The  mechanism  of  speech  production  and  a  model  of  the  vocal 
tract  is  discussed.  Short-time  spectral  analysis  is  presented.  A  review 
of  linear  prediction  analysis  is  discussed.  A  description  of  the  imple¬ 
mentation  of  operations  for  the  calculation  of  the  prediction  residual  is 
discussed.  A  novel  approach  to  pitch  extraction  is  presented. 


CHAPTER  III 


SUB-BAND  CODING  OF  THE  PREDICTION  RESIDUAL 

3.1  Introduction 

The  average  rate  that  speech  is  conveyed  between  humans  is  about  ten 
phonemes  per  second.  It  has  been  shown  that  the  information  rate  of 
speech  does  not  exceed  60  bits/second  [2]  [67].  For  the  information  con¬ 
tent  to  be  preserved,  the  human  must  be  able  to  extract  the  representation 
of  the  speech  signal  at  this  rate.  It  is  important  that  the  speech  is 
intelligible  to  the  listener,  and  this  aspect  is  the  fundamental  consid¬ 
eration  of  coding  speech. 

There  are  two  concerns  in  coding  speech  signals.  First,  the  message 
content  of  the  speech  must  be  preserved.  The  content  includes  linguistic 
rules  to  form  thoughts  for  humans  to  comnunicate.  Second,  the  speech 
signal  should  be  represented  so  that  it  can  be  transmitted.  At  the  re¬ 
ceiver,  the  signal  should  contain  the  message  without  serious  degrada¬ 
tions. 

The  interest  in  speech  coding  has  led  researchers  to  consider  tech¬ 
niques  that  enhance  signal  quality,  reduce  transmission  rate  and  cost, 
without  considering  the  complexity  of  the  coding  algorithm  [64].  The 
principle  is  to  enhance  the  perceptual  aspects  of  speech  through  the 
coding  method.  In  this  chapter,  some  basic  ideas  associated  with  speech 
coding  are  discussed.  These  include  transform  coding  and  sub-band-coding. 
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Since  the  speech  sounds  are  characteristically  different  than  most  acous¬ 
tic  sounds,  it  is  necessary  to  consider  the  properties  that  include  the 
formants  and  energy  of  phonemes.  Perceptual  aspects  that  contribute 
transitional  cues  for  humans  to  discriminate  and  differentiate  speech 
sounds  are  discussed  in  this  chapter.  It  is  known  that  when  human  lis¬ 
teners  are  exposed  to  speech,  available  to  them  are  a  set  of  responses 
that  are  highly  over-learned  [65].  The  minimum  discrimination  necessary 
for  absolute  differentiation  of  speech  sounds  is  discussed.  Recently, 
speech  coding  techniques  have  contributed  efficient  methods  to  enhance  the 
coding  of  speech  signals  with  few  degradations.  This  chapter  discusses 
some  of  these  methods  in  Section  3.2.  Section  3.3  presents  a  discussion 
of  the  transform  coding.  Section  3.4  presents  the  method  of  sub-band 
coding  in  detail.  Section  3.5  discusses  the  determination  of  frequency 
bands  by  the  Articulation  Index.  Section  3.6  presents  aspects  associated 
with  transitional  cueing  information  for  the  preception  of  certain  pho¬ 
nemes.  Section  3.7  discusses  perception  of  intelligible  speech.  Section 
3.8  describes  the  basis  for  coding  the  prediction  residual  at  the  rate  of 
9600  bits/second. 


3.2  Coding  Methods 

The  oldest  form  of  speech  coding  device  is  the  channel  vocoder  inven¬ 
ted  by  Dudley  [78].  Each  of  the  channels  has  center  frequency  For 
each  of  the  channels,  the  time-dependent  Fourier  transform  is  represented 
as  a  cosine  wave  with  center  frequency  which  is  phase  and  amplitude 
modulated  corresponding  to  the  magnitude  and  phase  angle,  respectively  of 
each  transform.  Therefore,  each  channel  is  thought  of  as  a  bandpass 
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filter  with  center  frequency  and  impulse  response  w(n).  This  is  shown 
in  Figure  15. 

The  analysis  section  consists  of  a  bank  of  channels  as  in  Figure  15 
with  frequencies  distributed  across  the  speech  band.  Figure  16  shows  a 
complete  channel  vocoder  analyzer. 

The  basic  diagram  for  the  synthesizer  is  somewhat  different.  The 
specific  channel  controls  the  amplitude  of  its  contribution  to  a  particu¬ 
lar  channel;  while  the  excitation  signals  control  the  detailed  structure 
of  the  output  of  a  given  channel.  The  voiced/unvoiced  decision  serves  to 
select  the  appropriate  excitation  generator,  i.e.,  random  noise  for  un¬ 
voiced  speech  and  pulse  generator  for  voiced  speech.  A  block  diagram  is 
shown  for  the  synthesizer  in  Figure  17.  Channel  vocoders  operate  in  the 
range  of  1200  bits/second  to  9600  bits/second.  They  are  also  referred  to 
as  source  coders  and  produce  speech  of  a  synthetic  nature  when  at  bit 
rates  below  4800  bits/second. 

A  major  contribution  of  a  channel  vocoder  is  the  reduction  in  bit 
rate;  however,  direct  representation  of  the  pitch  and  voicing  information 
is  not  achieved.  Therefore,  this  can  be  considered  as  a  weakness. 

The  LPC  vocoder  is  a  very  important  application  of  linear  predictive 
analysis  in  the  area  of  low  bit  rate  encoding  of  speech.  It  is  shown  in 
Figure  18. 

The  basic  LPC  analysis  parameters  consists  of  a  set  of  P  predictor 
coefficients,  the  pitch  period,  a  voiced/unvoiced  parameter  and  a  gain 
parameter.  The  vocoder  consists  of  a  transmitter  which  performs  the  LPC 
analysis  and  pitch  detection.  These  parameters  are  coded  and  transmitted. 
They  are  decoded  and  synthesized  to  output  speech.  This  category  of 
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CHANNEL  VOCODER  SYNTHESIZER 


Block  Diagram  of  Channel  Vocoder  Synthesizer 
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coding  is  of  the  vocoder  type.  In  the  following,  the  discussion  of  vari¬ 
ous  aspects  is  included. 

Speech  coding  can  be  divided  into  two  broad  categories,  waveform 
coders  and  vocoders  [79].  There  has  been  some  metion  of  a  few  types  of 
vocoders  earlier.  Waveform  coders  generally  attempt  to  reproduce  the 
original  speech  waveform  according  to  some  fidelity  criteria.  On  the 
other  hand,  vocoders  model  the  input  speech  according  to  some  speech  pro¬ 
duction  model;  then,  synthesize  the  speech  from  the  model.  The  basic 
make-up  for  coding  the  prediction  residual  in  this  thesis  is  of  a  vocoder 
model.  However,  techniques  of  wav  form  coding  are  also  used.  It  has 
been  shown  that  waveform  coders  tend  to  give  better  quality  speech  that 
is  robust;  whereas,  vocoders  tend  to  be  more  synthetic  [64]  [79].  Bor¬ 
rowing  from  the  techniques  of  efficient  waveform  coders,  it  is  conceiv¬ 
able  to  define  an  acceptable  coding  algorithm  to  meet  quality  standards 
at  low-bit  rates  of  transmission.  A  primary  interest  has  been  to  produce 
the  transmitted  speech  with  the  minimum  bit  rate  and  still  meet  accept¬ 
able  quality  [80].  Previously  mentioned  were  methods  available  to  date 
for  coding  of  the  residual.  Efficient  methods  to  improve  the  coding 
techniques  are  presented  for  coding  the  prediction  residual. 

It  has  been  recognized  that  there  are  two  efficient  methods  of  wave¬ 
form  coding  [79].  These  are:  (1)  transform  coding  (TC)  [81]  and  (2) 
sub-band  coding  (SBC)  [36]  [37].  These  are  characterized  as  frequency- 
domain  coders,  whereas  examples  of  PCM,  differential  PCM,  and  DM  are  the 
time-domain  coders.  Frequency-domain  coders  are  perceptually  better  than 
time-domain  coders  because  they  tend  to  exploit  the  pitch  of  the  speech 
waveform  for  bit  rates  below  16000  bits/second.  They  tend  to  look  at  the 
spectrum  of  speech  in  blocks,  whereas  the  predictive  systems  look  at 


adjacent  samples.  These  two  methods  will  be  explained  in  detail  in  the 
next  two  sections. 


3.3  Transform  Coding 

With  transform  coding  (TC)  [81],  the  system  of  speech  samples  is 
grouped  into  blocks,  where  each  block  corresponds  to  the  windowed  segment 
of  the  speech  signal .  These  blocks  of  speech  are  transformed  into  a  set 
of  transform  coefficients;  then,  the  coefficients  are  quantized  indepen¬ 
dently  and  transmitted.  An  inverse  transform  is  taken  at  the  receiver  to 
obtain  the  corresponding  block  of  reconstructed  samples  of  speech  (see 
Figure  19). 

A  basic  assumption  in  this  method  is  that  the  speech  source  is  sta¬ 
tionary  and  has  a  variance  of  a2.  The  successive  source  output  samples 
are  arranged  into  the  N-vector  X;  this  vector  X  is  linearly  transformed 
using  a  unitary  matrix  A,  i.e., 

Y  =  AX  (3.1) 

where  A,  in  general,  is  complex,  and 

AA*  =  I  (3.2) 

where  *  denotes  the  transpose  conjugate.  The  elements  of  Y  are  the  trans¬ 
form  coefficients.  These  are  independently  quantized,  yielding,  Y.  The 

vector  Y  is  transmitted  to  the  receiver  and  then  inverse  transformed. 

Then 

X  =  A*  Y  (3.3) 

Since  the  vector  X  is  reconstructed  output,  distortion  is  involved.  For 
unitary  matrices  the  averaged  mean-squared  overall  distortion  of  the 
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transform  coder  is  equal  to  the  guunti zation  error  [82] 

D  =  ^  •  Ef(X  -  X)T  •  (X  -  X)} 

-  I  •  El (Y  -  Y)T  •  (Y  -  Y)l  (3.4) 

vhero  E{  }  represents  the  expectation.  The  minimization  of  D  will  yield 
an  optimum  bit-assignment  rule  and  an  optimum  transform  matrix  A  [81]. 

Let  be  the  number  of  bits/sample  needed  for  the  coefficient  Y^  (an 
entry  in  the  Y  vector)  of  variance  o?  so  that  the  mean-squared  distortion 
D.  =  E[Y^  -  Y^)?]  is  not  exceeded.  Then  [82] 

Ji  =  6  +  ~  log2  [  — -  ]  (3.5) 

where  6  is  a  correction  factor  that  takes  into  account  the  performance  of 
a  practical  quantizer.  The  optimum  number  of  bits  for  the  quantizer  can 
be  obtained  by  minimizing  the  average  distortion 

1  N 

D  =  J  t  D.  (3.6) 

n  i  =  l  1 


with  the  constraint  of  a  given  average  bit  rate 

1  N 

R  =  l  J.  =  constant 
N  ^  i 


(3.7) 


The  optimum  bit  assignment  is  [81] 


J.  -  R  +  2  ’  l°clp  c 


N 


u  a . 

iO  1 


1/N 


bit/sample  i  =  1 ,  2,  . . . ,  N 


(3.8) 


The  average  distortion  is  found  to  be 
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D 


»?6 


2 


-2R 


L  j=i 


(3.9) 


Here  the  distortion  introduced  by  the  transform  coding  scheme  depends  on 
the  distribution  of  variances.  In  addition,  D  is  found  to  be  the  geomet¬ 
ric  mean  of  the  variance.  This  leads  to  the  solution  of  A  matrix.  Let 

R  and  R  be  the  covariance  matrices  of  X  and  Y,  then 
xx  yy 

N 

det  R  <  v  of  (3.10) 

yy  -  i=1  1 


and  for  any  unitary  matrix  A 

det  R  =  det  R  (3.11) 

xx  yy  ' 

In  particulre,  the  variances  of  are  along  the  diagonal  of  R  ;  then, 

N 

det  R  =  it  A.  (3.12) 

xx  i-1  1 

where  A.  are  the  eigenvalues  of  R  .  Therefore,  the  minimum  distortion 

1  XX 

is  found  if  the  variances,  oi,  are  equal  to  the  eigenvalues  of  Rxx  [81]. 
The  Karhunen-Loeve  transform  (KLT)  has  the  property  that  cf  =  for  all 
i . 

Other  unique  properties  of  KLT  are:  (1)  transform  coefficients  are 
uncorrelated,  (2)  the  covariance  in  the  KLT  domain  is  diagonal,  and  there¬ 
fore,  the  transform  coefficients  can  be  quantized  independently  without 
the  loss  of  performance  [83]. 

It  has  been  noted  that  the  KLT  gives  optimum  performance;  however, 
there  is  a  lack  of  a  last  algorithm  for  the  computation  of  the  coeffi¬ 
cients.  In  addition,  the  computation  is  quite  complex.  Since  speech  is 
a  quasi-periodic  signal,  transform  coding  would  not  be  efficient  unless 
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adaptive  methods  are  used.  However,  this  area  still  needs  additional 
studies.  Zelenski  and  Holl  presented  promising  results.  Tribolet  and 
others  [38]  have  done  additional  work  in  this  area  also.  Zelenski  and 
Noll  experimented  with  the  Walsh-Hadamard  transform  (WHT),  the  discrete 
slant  transform  (DST),  the  discrete  Fourier  transform  (DFT),  and  the  dis¬ 
crete  cosine  transform  (DCT)  to  compare  with  the  KLT.  All  these  have 
fast  algorithms  and  are  signal  independent.  Zelenski  and  Noll  found  that 
the  basis  vectors  of  the  DCT  and  KLT  are  close;  however,  the  KLT  is  sig¬ 
nal  dependent.  It  has  been  shown  that  the  performances  of  the  DCT  and 
KLT  are  similar  [84].  The  studies  of  Tribolet  and  others  found  TC  to  be 
complex  and  costly;  however,  this  method  proves  to  be  superior  when  com¬ 
pared  to  other  systems  [38]. 

3.4  Sub-Band  Coding 

It  is  desired  to  retain  the  basic  components  of  speech  composition 
and  phonemic  quality.  TC  is  a  very  efficient  method  of  completing  the 
endeavor;  however,  due  to  cost  and  complexity,  it  was  discarded.  The 
method  of  sub-band  coding  [36]  has  some  very  distinct  advantages  whereby 
the  original  goal  can  be  met  in  order  to  secure  as  much  of  the  speech 
signal  as  possible.  One  criterion,  perceptual  in  nature,  is  the  reten¬ 
tion  of  transitional  information.  Also,  the  intelligibility  of  speech 
can  be  maximized  by  the  use  of  the  Articulation  Index  [29],  which  is 
discussed  in  Appendix  C. 

With  sub-band  coding  the  frequency  spectrum  is  partitioned  such  that 
each  sub-band  contributes  accordingly  to  the  speech  intelligibility  which 
is  quantified  by  the  Articulation  Index.  The  Articulation  Index  is  a 
weighted  fraction  representing,  for  a  given  speech  channel  and  noise 
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condition,  the  effective  proportion  of  the  normal  speech  signal  which  is 
available  to  a  listener  for  conveying  speech  intelligibility  [86].  The 
speech  spectrum  can  be  divided  into  20  frequency  bands  contributing  5  per¬ 
cent  each  to  the  Articulation  Index.  In  this  case,  the  frequency  spectrum 
can  be  bandpass  filtered  in  such  a  way  that  they  contribute  equally  to  the 
Articulation  Index.  An  example  given  by  Crochiere  and  others  [36]  in 
Table  IV  addresses  a  sub-band  partitioning  of  four  bands  between  200  to 
3200  Hz. 


TABLE  IV 

SUB-BAND  PARTITIONING  EXAMPLE 


Sub-Band  No. 

Frequency  Range  (Hz) 

1 

200  -  700 

2 

700  -  1310 

3 

1310  -  2020 

4 

2020  -  3200 

Obviously,  there  are  other  possibilities  of  partitioning  the  speech 
band  [37].  Each  band  contributes  an  equal  20  percent  to  the  total  Artic¬ 
ulation  Index.  The  total  Articulation  Index  is  80  percent,  which  corre¬ 
sponds  to  a  word  intelligibility  of  approximately  93  percent  [36]. 
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Sub-band  coding  has  another  advantage  which  involves  quantization. 
Each  sub-band  is  quantized  separately  and  each  band  contains  its  own  dis¬ 
tortion  and,  therefore,  quantization  noise  could  be  considered  separately 
for  each  band  [36].  Furthermore,  because  of  the  nature  of  the  spectrum 
of  speech,  the  detectability  of  this  distortion  is  not  the  same  at  all 
frequencies. 

Since  the  proposed  method  is  based  upon  sub-band  coding  the  resid¬ 
ual,  the  presentation  is  in  terms  of  the  prediction  residual,  ef(k). 

For  the  following  discussion,  assume  that  the  sub-bands  are  parti¬ 
tioned  as  shown  in  Figure  20.  Let  the  width  of  each  of  these  bands  be 
identified  by 

Wn  =  wn+l  "  wn  n  =  1,  2,  ....  N=4  (3.13) 

where  un  corresponds  to  the  edges  of  these  bands.  The  implementation  of 
the  sequence  of  operations  leading  from  the  residual  to  the  coded  output 
for  transmission  is  shown  in  Figure  21.  Also,  shown  in  Figure  21  is  the 
implementation  at  the  receiver.  From  this  figure,  it  follows  that 

r(k)  =  (efn(k)  cos  u>n  k)  *  hn(k)  (3.14) 

where  efn(k)  corresponds  to  the  output  of  the  nth  bandpass  filter  and 
hn(k)  corresponds  to  the  impulse  response  of  the  nth  lowpass  filter.  It 
is  clear  that 

Wn  <  2u>n  (3.15) 

in  order  that  the  frequency  bands  are  properly  separated.  Then  r(k)  is 
decimated  to  the  rate  2Wp  from  the  original  sampling  frequency.  This 
signal  is  then  encoded  and  multiplexed  with  the  other  channels.  At  the 
receiver,  the  signal  is  demultiplexed,  decoded,  interpolated,  demodulated 


i 


LOG  MAG  (dB) 


83 


(i)|  Cl>2  CU3  (jJq  CO5 

FREQUENCY  (kHz) 


Figure  20.  Partitioning  of  Frequency  Spectrum  into  Four  Sub-Bands 
(After  Crochiere*  1976) 
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and  bandpass  filtered  to  give  efn(k).  This  is  shown  in  Figure  21.  The 
nth  sub-band  is  then  summed  with  other  bands  to  produce  e^(k),  which  is 
the  sub-band  coded  and  decoded  version  of  the  signal.  The  total  imple¬ 
mentation  of  the  system  will  be  discussed  later. 


3.4.1  Sub-Band  Coding  and  Transform  Coding 


Earlier  it  was  pointed  out  that  frequency-domain  coders  can  be  con¬ 
sidered  as  a  good  basis  for  an  efficient  coder.  In  this  section  the  re¬ 
lationship  between  sub-band  coding  and  transform  coding  is  discussed. 

Considering  the  ideal  case,  in  which  there  are  M  sub-bands  corre¬ 
sponding  to  the  M  samples,  let  the  discrete  cosine  transform  (DCT)  of  the 
residual  signal,  ef(k),  k  =  0,  . . . ,  M-l ,  be  represented  by  [84] 
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Correspondingly,  the  residual  signal  e^(k)  is  given  by 

<Mk>  ■  "k'  cos  X24irM  k  =  0-’ . 

*  n  n=  I 


(3.17) 

which  obviously  corresponds  to  the  inverse  discrete  cosine  transform 
( I DCT ) .  Using 
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in  Figure  21,  it  is  seen  after  modulation  and  low-pass  filtering 

r(k)  =  «n  k  =  0,  1 . M-l  (3.19) 
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Since  there  are  M  sub-bands  corresponding  to  the  M  frequencies,  and  since 
r(k)  is  a  constant,  it  follows  that  after  the  decimation  only  one  point 
Is  given  for  each  band  and  that  value  is  an-  The  encoder  in  Figure  21 
codes  the  DCT  coefficient.  This  points  out  the  fact  that  in  the  ideal 
case  ( i . e . ,  filters  and  modulators  are  ideal),  the  sub-band  coding  will 
be  equivalent  to  the  discrete  cosine  transform  coding.  Obviously,  the 
discussion  above  can  be  generalized  for  the  case  wherein  there  are  N  sub¬ 
bands  (N  <_  M)  rather  than  M. 

It  is  clear  that  where  the  components  in  the  sub-band  coder  are  non¬ 
ideal,  the  r ( k )  are  not  equal  to  a^.  Further  work  is  necessary  in  quan¬ 
tifying  the  difference  between  r(k)  and  an  [85]. 

Noting  the  simplicity  in  the  sub-band  coder  and  also  noting  the  re¬ 
lationship  between  the  transform  coder  and  sub-band  coder,  the  sub-band 
coder  is  more  practical. 

3.5  Determination  of  Frequency  Sub-Bands 
Based  on  Articulation  Index 

The  Articulation  Index  (AI)  is  a  weighted  fraction  representing,  for 
a  given  speech  channel  and  noise  condition,  the  effective  proportion  of 
the  normal  speech  signal  which  is  available  to  a  listener  for  conveying 
speech  intelligibility  [29]. 

In  this  section,  the  methods  of  determining  how  to  achieve  maximum 
intelligibility  based  on  using  the  AI  are  examined.  There  are  two  meth¬ 
ods  for  computing  AI.  The  first  method,  called  the  20-band  method  by 
French  and  Steinberg  [86],  is  based  on  measurements  or  estimates  of  the 
spectrum  of  the  speech  and  noise  present  in  each  of  the  20  continuous 
bands  of  frequencies.  Each  band  contributes  equally  to  the  speech 
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derived  from  the  first  method.  It  requires  measurements  of  the  speech 
and  noise  present  either  in  certain  one-third-octave-band  or  in  certain 
full  octave  bands. 

Some  researchers  consider  these  two,  i.e.,  one-third-octave-band  and 
full  octave-band  measurements,  as  different  methods.  The  octave-band 
method  is  not  as  sensitive  to  variations  in  the  speech  and  noise  spectra 
as  the  20-band  or  the  one-third-octave-band  method.  An  example  where  it 
falls  apart  is  in  situations  where  an  appreciable  fraction  of  the  energy 
of  the  masking  noise  is  concentrated  in  a  band  of  frequency  that  is  one 
octave  or  less  in  width;  under  these  conditions,  the  one- third-octave- 
band  or  the  20-band  method  would  be  better  to  use. 

The  20  frequency  bands  are  those  specified  by  Beranek  for  male 
voices  [87].  These  bands  are  shown  in  Table  XXIV  in  Appendix  C.  In  order 
to  use  the  20-band  method  to  calculate  the  AI,  the  peaks  of  the  spectrum 
of  the  speech  signal  (PSS)  must  be  approximated  first.  The  level  depends 
on  if  the  speech  is  spoken  through  earphones  or  a  loudspeaker.  There  is 
an  adjustment  to  either  case  of  -65  dB  which  is  considered  as  the  over-all 
long-term  rms  sound-pressure  level  of  an  idealized  speech  spectrum.  How¬ 
ever,  with  the  loudspeaker,  an  additional  amount  is  adjusted  according  to 
Table  V  [29].  This  is  due  to  the  assumption  that  the  room  is  semi  rever¬ 
berant;  whereas,  earphones  do  not  present  reverberance.  1 

These  corrections  are  obtained  from  experiments  conducted  in  a  re-  1 

verberant  room  using  a  loudspeaker  and  from  experiments  conducted  in  an  I 

anechoic  chamber  [29].  ■ 

Also  an  additional  correction  must  be  added  to  correct  for  the  noise  ■ 


spectrum.  This  is  shown  in  Table  VI  [29],  The  noise  that  reaches  the 
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listener's  ear  is  assumed  to  be  that  of  a  steady-state  nature.  All 
noises  in  the  listener's  environment  and  the  noise  in  transmission  systems 
are  combined  to  arrive  at  the  noise  spectrum  level. 


TABLE  V 

ADJUSTMENTS  TO  THE  SPECTRUM  OF  THE  SPEECH  SIGNAL 


Maximum  Spectral  Values 
of  Speech  Signal 

Amount  to  be 
Subtracted 

85  dB 

C  dB 

90 

2 

95 

4 

100 

7 

105 

11 

110 

15 

115 

19 

120 

23 

125 

27 

130 

30 

The  corrected  noise  spectrum  (NS)  has  the  effect  of  masking  the 
speech  signal.  The  noise  spectrum  increases  at  a  faster  than  normal  rate 
when  the  band  sensation  level  of  the  speech  sound  exceeds  80  dB  [86]. 

Ihis  band  sensation  level  is  defined  as  the  difference  in  decibels  between 
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the  sound  integrated  over  a  frequency  band  and  the  sound  pressure  level 
of  that  band  when  the  speech  sound  is  at  the  threshold  of  audibility  in 
an  anechoic  room.  The  increase  in  masking  is  taken  into  account  in  the 
calculation  of  AI  by  adding  to  the  PSS.  If  the  band  sensation  level  of 
the  sound  exceeds  80  dB  at  the  center  frequency  of  a  band,  then  the  PSS 
is  increased  by  the  amount  that  is  shown  in  Table  VI. 


TABLE  VI 

ADJUSTMENTS  FOR  NOISE  SPECTRUM 


Band  Sensation 
Level 


Added 

Amount 


80 

0 

85 

1 

90 

2 

95 

3 

100 

4 

105 

5 

110 

6 

115 

7 

120 

8 

125 

9 

130 

10 

135 

11 

140 

12 

145 

13 

150 

14 

90 


The  noise  spectrum  level  (NS)  is  compared  to  PSS  at  the  mid-frequen¬ 
cies  of  the  20  bands  given  in  Table  XXIV  in  Appendix  C.  Values  that  are 
zero  or  less  are  set  to  zero.  When  PSS  exceeds  the  noise  by  30  dB>  then 
that  difference  is  set  to  30.  This  is  due  to  the  limitation  on  the  dy¬ 
namic  range  of  speech  [87]. 

The  Articulation  Index  is  defined  as 

AI  -  ?  •  <4A>max  <3'20> 


where 

(AA)  is  the  contribution  from  one  band  and  has  a  maximum  value  of 
'  'max 

0.05. 

Wn  is  the  percent  of  maximum  contribution  by  any  one  band 
and 


W 


n 


PSS  -  NS 
30 


(3.21) 


where  30  represents  the  dynamic  range  of  the  speech  band  and  is  a  normal¬ 
ized  so  that  W^  is  limited  to  unity.  Therefore,  for  20  bands,  the  normal¬ 
ization  is  limited  to  600.  An  illustrative  example  is  given  by  Kryter 
[29]. 

Consider  the  one-third-octave-band  and  octabe-band  method.  The  cen¬ 
ter  and  cut-off  frequencies  for  these  are  shown  in  Tables  VII  and  VIII 
[29]. 

With  the  one-third-octave  and  octave-band  methods,  the  correction 
levels  shown  in  Table  V  should  be  considered  for  signals  received  from 
the  loudspeaker.  Also,  the  NS  must  be  calculated  from  Table  VI,  and  the 
weighting  factors  need  to  be  computed  from  (3.21)  for  each  band.  These 
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are  then  summed  to  give  the  AI  for  the  speech  system  operating  under  the 
noise  conditions  and  the  level  of  speech. 


TABLE  VII 

FREQUENCIES  RELATED  TO  ONE-THIRD-OCTAVE-BAND  METHOD 


Third- 

Octave  Band 

Center  Frequency 

179  - 

224 

200 

224  - 

280 

250 

280  - 

353 

315 

353  - 

448 

400 

448  - 

560 

500 

560  - 

706 

630 

706  - 

896 

800 

896  - 

1120 

1000 

1120  - 

1400 

1250 

1400  - 

1790 

1600 

1790  - 

2240 

2000 

2240  - 

2800 

2500 

2800  - 

3530 

3150 

3530  - 

4480 

4000 

4480  - 

5600 

5000 
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TABLE  VIII 

FREQUENCIES  RELATED  TO  OCTAVE  METHOD 


Octave  Band 

Center  Frequency 

180  -  355 

250 

355  -  710 

500 

710  -  1400 

1000 

1400  -  2800 

2000 

2800  -  5000 

4000 

To  consider  how  the  different  methods  compare  for  the  same  speech 
signal  and  masking  noise,  Kryter  computed  the  AI  for  each  of  these  meth¬ 
ods.  For  20-band  method,  AI  =  0.38;  for  one-third-octave  method,  AI  = 
0.33;  and  for  octave-band  method,  AI  =  0.28.  Since  the  20-band  method  is 
the  basic  method  from  which  all  others  are  derived,  it  provides  the 
"correct"  AI  and  the  others  are  compared  to  this  AI. 

The  AI  can  be  compared  to  estimated  speech  intelligibility  scores  as 
shown  by  graph  in  Figure  22.  It  is  noted  that  the  intelligibility  score 
is  highly  dependent  on  the  constraints  placed  on  the  message  communicated 
The  greater  constraint  (for  instance,  the  smaller  the  amount  of  in¬ 
formation  content  in  each  item  of  message),  the  higher  the  percent  intel¬ 
ligibility  score  for  a  given  AI.  No  single  AI  can  be  used  as  a  criterion 
for  an  acceptable  communication  value.  It  is  a  function  of  messages 
transmitted  and  the  enunciation  of  the  talker  [29], 
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PEECH  INTELLIGIBILITY  VS.  ARTICULATION  INDEX 


Figure  22.  Relation  Between  AI  and  Various  Measures  of  Speech 
Intelligibility  (After  French  and  Steinberg,  1947) 
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The  Articulation  Index  is  a  good  quantitative  measure  of  speech 
intelligibility.  Speech  communication  can  be  enhanced  by  an  equal  appli¬ 
cation  of  AI  across  the  speech  spectrum  in  the  sub-band  coding.  In  the 
next  section,  the  transitional  cueing  of  phonemes,  another  aid  that  adds 
to  speech  communication  is  discussed. 

3.6  Transitional  Information 

The  human  speaks  in  an  uninterrupted  and  continuous  fashion  to  com¬ 
municate  thoughts.  The  underlying  basis  for  communication  is  the  pho¬ 
nemic  structure  that  connects  itself  by  means  of  transitional  cues  for 
the  perception  of  certain  phonemes  [1].  It  is  the  transitional  informa¬ 
tion  that  must  be  enhanced  to  aid  the  perception  needed  for  absolute 
discrimination  of  speech-like  sounds  [2],  Transitional  cues  are  a  set 
of  frequency  shifts  which  occur  in  the  second- formant  region  where  a 
consonant  and  a  vowel  join.  The  perception  of  a  given  phoneme  is 
strongly  conditioned  by  the  transitional  information  of  its  neighbors 
[2]. 

The  identification  of  phonemes  has  been  studied  under  various  con¬ 
ditions  by  a  group  at  the  Haskins  Laboratories  [1],  M any  of  their  ex¬ 
periments  have  used  synthetic  syllables.  The  combinations  of  syllables 
included  consonant-vowel  (CV)  syllables.  The  consonant  is  usually  a 
stop  out  of  a  group  of  phonemes  with  the  same  voicing.  The  vowels  were 
maintained  at  two  formants.  Further  work  has  been  done  by  Rabiner  [88] 
for  synthesis  of  phonemes  by  rules.  These  concluded  that  one  frequency 
variable  of  the  consonant  was  generally  adequate  to  distinguish  that  a 
consonant  of  the  group  was  uttered.  To  further  distinguish  the 
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consonant,  the  stop-vowel  formant  transitions  were  necessary  to  perceive 
the  consonant. 

Figures  23(a)  and  23(b)  illustrate  the  stop-vowel  formant  transi¬ 
tions.  In  Figure  23(a),  the  vowel  /a/  has  first  and  second  formants 
occurring  at  700  Hz  and  1200  Hz  respectively.  It  is  seen  that  the  con¬ 
sonants  /b/,  /d/  and  /g/  demonstrate  a  different  rise  or  fall  in  the 
second  formant  region.  The  second  formant  varies  because  each  consonant 
has  a  different  place  of  articulation.  The  place  of  articulation  for 
/b/,  /d/  and  /g/  are  front,  middle  and  back,  respectively.  It  is  seen 
in  Figure  23(a)  that  the  consonants  appear  to  commence  from  some  trajec¬ 
tory  determined  by  their  place  of  articulation. 

The  trajectory  point  is  further  illustrated  in  Figure  23(b).  This 
figure  uses  the  consonant  /d/  and  three  vowels,  /a/,  /i/  and  /u /.  It  is 
shown  that  the  consonant  /d/  has  a  loci  of  points  that  commence  in  the 
region  of  1600  Hz  for  the  second  formant.  It  has  been  shown  that  con¬ 
sonants  exhibit  this  property  of  transition  from  a  particular  frequency 
to  the  steady-state  value  of  the  vowels  [1], 

The  consonants  that  are  perceptually  heard  with  falling  second  for¬ 
mants  to  the  vowel  /a/  are  /d/  and  /g/.  The  consonant  /b/  is  heard  with 

a  rising  second  formant  to  the  vowel  /a/.  It  is  noted  that  a  shift  in 

second  formant  frequency  is  bounded.  With  falling  transitions  of  the 
second  formant,  /g/  is  heard  for  steady-state  levels  of  frequency  between 
2280  and  3000  Hz;  however,  between  1320  and  2280  Hz  the  sound  could  be 

/g/  or  /d/;  and,  below  1320  Hz,  it  is  identified  as  /d/  [1], 

The  importance  of  second- formant  transitions  is  shown  for  perceptual 
purposes.  Differences  in  the  acoustic  speech  signal  are  due  to  the  exci¬ 
tation  and  vocal  tract  configuration  for  different  consonants.  These 
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will  differ  as  shown  in  Figure  23  by  the  transition  regions.  The  con¬ 
sonant  transitions  are  the  principal  cues  for  the  perception  of  a  parti¬ 
cular  consonant.  The  transition  occurs  because  the  vocal  tract  has  one 
shape  for  the  vowel  and  one  shape  for  the  consonant.  The  change  in  the 
vocal  tract  and  the  effect  that  the  glottal  pulse  has  on  vowels  has  been 
addressed  recently  [12].  Later,  the  coding  aspects  of  transitional  in¬ 
formation  will  be  discussed. 

3.7  Relation  of  Perception 
to  Intelligible  Speech 

A  topic  that  has  been  mentioned  several  times  before  is  perception. 
Perception  related  to  the  Articulation  Index  and  transitional  information 
together  for  discrimination  of  speech  sounds.  A  quantitative  description 
of  speech  perception  is  not  possible.  However,  in  a  qualitative  sense, 
speech  perception  can  be  enhanced  when  the  intelligibility  of  speech  is 
increased.  In  this  section,  several  aspects  of  speech  perception  will  be 
discussed  to  show  the  need  to  address  this  subject. 

Speech  perception  can  be  defined  as  the  ability  for  humans  to  dis¬ 
criminate  and  differentiate  the  character  of  speech  sounds.  Discrimina¬ 
tion  is  examined  along  fundamental  dimensions  of  the  hearing  mechanism 
and,  in  general,  one  dimension  at  a  time.  The  ear  takes  measurements  and 
makes  differential  comparisons.  These  comparisons  may  be  of  frequency 
and  intensity.  The  over-learned  senses  of  the  brain  distinguishes  the 
speech  from  other  periodic  waves.  Further,  the  speech  must  be  broken  in 
to  its  discrete  elements,  the  phonemes.  Once  the  signal  is  perceived  as 
speech,  there  are  other  factors  that  determine  the  fundamental  character¬ 
istics  of  recognizing  intelligible  speech. 
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The  ability  to  recognize  and  understand  speech  determines  intelligi¬ 
bility.  The  intelligibility  of  speech  may  be  affected  in  several  ways 
[86].  These  may  include  echoes,  phase  distortion,  or  reverberation.  Un¬ 
natural  sounding  speech  can  influence  intelligible  understanding  of 
speech  sounds.  The  intensity  of  the  speech  may  affect  intelligibility  of 
speech  received  by  the  ear.  Noise  in  a  transmission  medium  may  affect 
intelligibility  by  masking  the  speech.  The  talker  and  listener  have 
several  factors  that  can  cause  unacceptable  intelligibility  related  to 
the  speech  [86].  These  are  given  below: 

a.  The  basic  characteristics  of  the  speech  can  be  destroyed. 

b.  The  electrical  and  acoustic  instruments  which  operate  between 
the  talker  and  the  listener  may  not  be  adequate. 

c.  The  condition  under  which  the  communication  takes  place  may  not 
be  acceptable. 

d.  As  a  result  of  c.,  the  behavior  of  the  talker  and  listener  may 
be  modified  by  the  characteristics  of  the  communication  system. 

The  perception  of  intelligible  speech  is  related  to  the  amount  of 
information  spoken.  This  is  shown  in  Figure  22.  The  exactness  with 
which  the  listener  identifies  speech  sounds  is  related  to  the  size  of 
the  vocabulary  and  the  sequence  or  context  of  the  message.  As  seen  from 
Figure  22,  the  more  predictable  the  message  is,  the  better  the  intelli¬ 
gibility.  It  has  been  shown  that  as  the  vocabulary  size  increases,  a 
higher  signal -to-noise  ratio  is  necessary  to  maintain  performance  [2], 

Perceptual  aspects  of  speech  are  influenced  greatly  by  semantics 
and  context.  The  ability  to  predict  the  speech  utterance  enhances  intel¬ 
ligibility.  The  grammatical  rules  of  a  language  are  part  of  the  human 
over-learned  senses  [65].  Consequently,  the  language  prescribes  a 


99 


certain  allowable  sequence  of  words.  The  semantic  factors  occur  as  part 
of  the  rules  because  certain  words  must  be  associated  with  meaningful 
units  [66].  It  has  been  shown  that  intelligibility  of  speech  is  substan¬ 
tially  higher  when  a  grammatically  correct  and  meaningful  sentence  is 
spoken  than  when  using  the  same  words  randomly  [65].  The  over-learned 
senses  reduce  the  number  of  alternative  words  from  the  context,  and 
therefore,  the  listener  has  improved  intelligibility. 

The  application  of  speech  perception  is  an  adaptive  process.  The 
listener  uses  the  detection  procedure  within  the  reception  system  of  the 
ear  to  determine  the  speech  communication  process.  The  listener  can 
absolutely  identify  speech  when  given  the  basic  sound  elements  of  the 
speech.  The  sound  elements  are  discriminated  and  differentiated  from 
other  periodic  sounds  to  perceive  speech.  If  the  speech  is  intelligible, 
the  exactness  is  not  only  related  to  how  good  the  transmission  medium  is 
but  also  to  the  length  of  the  utterance  and  its  context.  These  concepts 
are  applied  in  the  next  section  to  aggregate  a  coding  algorithm  for 
transmission  of  perceptually  enhanced  speech. 

3.8  Basis  of  Coding  the  Predictional  Residual 

A  coding  method  is  presented  to  perceptually  enhance  the  speech. 

The  method  uses  sub-band  coding  (SBC)  for  coding  the  prediction  residual. 
Besides  SBC  being  conceptually  simple,  it  has  the  additional  advantage 
that  each  sub-band  is  quantized  separately  and  each  band  contains  its 
own  distortion.  It  should  be  pointed  out  that  the  input  to  the  sub-band 
coder  is  the  residual  signal  rather  than  the  speech  signal.  Some  of  the 
reasons  for  this  approach  are: 

a.  A  more  efficient  bit  distribution  based  on  energy/frame. 
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b.  A  more  pronounced  pitch  information  in  the  residual  signal,  and 

c.  An  ideal  input  for  the  synthesizer  at  the  receiver. 

In  an  earlier  section,  the  advantages  of  using  the  Articulation  In¬ 
dex  in  SBC  have  been  discussed.  Each  sub-band  is  selected  such  that  each 
contributes  equally  to  the  Articulation  Index  [36].  However,  it  has  been 
shown  that  "satisfactory"  performance  can  be  expected  if  this  equal  con¬ 
tribution  to  the  articulation  criterion  can  be  met  within  a  factor  of 
two  [37]  [87].  This  relaxation  of  the  criteria  was  allowed  for  integer- 
band  sampling  with  good  results  [36]  [37],  That  is,  the  sub-bands  are 
between  m^  and  (mi  +1)  ,  where  m.  is  an  integer.  The  method  has 

popularity  because  it  eliminates  the  need  for  modulators.  Even  though 
the  integer-sampling  method  requires  less  hardware,  the  selection  of  sub¬ 
bands  using  the  articulation  criteria  would  give  better  perception.  There 
has  been  some  research  done  in  the  selection  of  the  sub-bands  by  this 
method  [37].  Also,  it  should  be  pointed  out  that  the  sub-band  selection 
depends  on  the  multiplexing  of  the  encoded  speech  [37].  This  subject 
will  be  further  discussed  in  Chapter  IV. 

The  coding  scheme  of  the  residual  is  based  on  enhanced  transitional 
cues.  It  has  been  shown  that  the  second  formant  is  important  for  percep¬ 
tual  purposes.  The  exact  development  will  be  discussed  in  this  section. 

The  spectrum  of  the  signal  is  used  for  calculation  of  the  energy. 

The  energy  can  be  represented  by  [108] 

E  =  iV  | Ef (k)  | 2  (3.22) 

n  k=0  T 

where  E^(k)  corresponds  to  the  discrete  Fourier  transform  (DFT) 
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coefficients  of  the  signal  e^(k),  which  can  be  computed  by  using  the  fast 
Fourier  transform  (FFT)  algorithm. 

Equation  (3.22)  is  applied  to  the  prediction  residual  to  compute  the 
energy.  The  spectrum  of  the  prediction  residual  is  partitioned  into  four 
sub-bands  as  stated  before.  Using  (3.22),  the  energies  in  each  sub-band 
can  be  expressed  by 

En  =  ^  £  |Efn(k)|2  n  =  1,  2,  3,  4  (3.23 

where  E^n ( k )  is  the  OFT  coefficient  of  the  signal  corresponding  to  the 
nth  sub-band. 

Now  the  total  energy  can  be  expressed  by 

4 

ET  =  EE  (3.24) 

1  n=l  n 

Among  speech  sounds,  Ey  has  wide  variance.  Previous  researchers  have  not 
studied  the  variations  in  Ey  of  the  speech  sounds  for  each  prediction  re¬ 
sidual.  This  aspect  is  discussed  in  the  next  section. 

3.8.1  Energy  Distribution 

This  section  gives  the  results  on  the  energy  data  for  phonemes.  The 
goal  of  the  energy  study  is  to  distinguish  between  vowels,  nasals  and 
noisy  sounds.  This  data  is  used  in  the  next  chapter  to  determine  the  bit 
distribution  in  the  coding  algorithm. 

The  phonemic  data  used  in  this  thesis  was  obtained  from  recordings 
of  a  number  of  monosyllabic  utterances  of  a  male  talker  made  in  an  ane- 
choic  chamber.  These  utterances  were  lowpass  filtered  to  3600  Hertz. 

The  lowpassed  filtered  signal  was  then  digitized  at  8000  Hertz  using  the 
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program  DIGITIZ.  The  digitized  data  is  stored  on  the  INTERDATA  computer 
system  disk  in  data  file  BURGE.DAT. 

For  future  use,  the  sentence  data  in  digitized  format  [146]  is 
stored  on  the  IBM  370  computer  system.  The  data  was  lowpass  filtered  to 
4000  Hertz  and  samples  at  8000  Hertz.  This  data  is  stored  in  the  files 
listed  in  Table  IX  with  a  description  of  the  data.  Representative  sona- 
grams  of  Table  IX  are  shown  in  Appendix  D. 


TABLE  IX 
SENTENCE  DATA 


Sentence  Description 


File 


"The  pipe  began  to  rust  while  new" 

"Add  the  sum  to  the  product  of  these  three" 
"Open  the  crate  but  don't  break  the  glass" 
"Oak  is  strong  and  also  gives  shade" 
"Thieves  who  rob  friends  deserve  jail" 

"Cats  and  dogs  each  hate  the  other" 


OSU. ACT10161 .SPEECH1 
OSU . ACT1 01 61 . SPEECH2 
OSU . ACT1 01 61 . SPEECH3 
OSU .ACT! 0161 .SPEECH4 
0SU.ACT10161 .SPEECH5 
OSU . ACT! 01 61 .SPEECH6 


The  phonemic  utterances  used  in  this  thesis  are  shown  in  Table  X. 
Table  X  represents  a  wide  variety  of  speech  sounds.  The  consonants  / b/ 
and  /h/  are  used  to  utter  syllables  of  the  form  consonant- vowel -consonant 
(CVC)  with  the  consonant  /d/  in  the  final  position  for  the  vowels,  such 
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TABLE  X 
PHONEMIC  DATA 


No. 

Utterance 

No. 

Utterance 

No. 

Utterance 

No. 

Utterance 

1 

.. 

41 

/y/ 

81 

HI 

121 

/hid/ 

2 

/i/ 

42 

/y/ 

82 

/bit  / 

122 

/hid/ 

3 

/i/ 

43 

/ml 

83 

/bit/ 

123 

/hed/ 

4 

/I/ 

44 

/ml 

84 

/bet/ 

124 

/haed/ 

5 

III 

45 

IN 

85 

/baft/ 

125 

/hAd/ 

6 

IN 

46 

IN 

86 

/bAt/ 

126 

/hDd/ 

7 

IN 

47 

/N 

87 

NON 

127 

/hUd/ 

8 

/*/ 

48 

IN 

88 

/bUt/ 

128 

/hud/ 

9 

/*/ 

49 

IN 

89 

/fut/ 

129 

/h^d/ 

10 

IN 

50 

/b/ 

90 

/but 

130 

/hald/ 

11 

IN 

51 

IN 

91 

/bjh/ 

131 

/hDId/ 

12 

/a/ 

52 

IN 

92 

/als/ 

132 

/haUd/ 

13 

/a/ 

53 

IN 

93 

IbDl/ 

133 

/hoUd/ 

14 

ID/ 

54 

IN 

94 

/baU/ 

134 

/held/ 

15 

Id/ 

55 

/Pi 

95 

/boU/ 

135 

/hjud/ 

16 

IN 

56 

IN 

96 

/belt/ 

136 

/awa/ 

17 

/  u/ 

57 

IN 

97 

/ju/ 

137 

/ala/ 

18 

lul 

58 

-- 

98 

/wll/ 

138 

/ara/ 

19 

IN 

59 

IN 

99 

/111/ 

139 

/aya/ 

20 

in 

60 

IN 

100 

/HI/ 

140 

/ama/ 

21 

in 

61 

IN 

101 

/yii/ 

141 

/ana/ 

22 

/all 

62 

IN 

102 

/mil/ 

142 

/sen/ 

23 

-- 

63 

IN 

103 

/nil/ 

143 

/aba/ 

24 

/all 

64 

HI 

104 

/sen/ 

144 

/ada/ 

25 

IDU 

65 

IN 

105 

/bll/ 

145 

/aga/ 

26 

nu 

66 

INI 

106 

/dll/ 

146 

/apa/ 

27 

laU/ 

67 

INI 

107 

/gii/ 

147 

/ata/ 

28 

laU/ 

68 

hi 

108 

/  pH/ 

148 

/aka/ 

29 

loW 

69 

IN 

109 

/til/ 

149 

/aha/ 

30 

/oU/ 

70 

IN 

no 

/kll/ 

150 

/aja/ 

31 

/el/ 

71 

IN 

111 

/hll/ 

151 

/at/a/ 

32 

/el/ 

72 

IN 

112 

/jll/ 

152 

/ava/ 

33 

/jU/ 

73 

/z/ 

113 

/t/Il/ 

153 

/a#a/ 

34 

/ju/ 

74 

IN 

114 

/vll/ 

154 

/aza/ 

35 

/w/ 

75 

IN 

115 

/Net/ 

155 

/afa/ 

36 

/W/ 

76 

/e/ 

116 

/all/ 

156 

/a8a/ 

37 

/!/ 

77 

/0/ 

117 

/fll/ 

157 

/asa/ 

38 

/I/ 

78 

Is/ 

118 

/ba £*6/ 

158 

/a/a/ 

39 

/r/ 

79 

Is/ 

119 

/sll/ 

40 

/r/ 

80 

III 

120 

//II/ 

104 


as  /hid/.  The  vowel  /a/  is  used  to  utter  nonsense  syllables  of  the  form 
vowel -consonant- vowel  (VCV)  in  both  initial  and  final  positions,  such  as 
/aba/.  A  set  of  minimal  units  using  the  final  form  -/II/  (-ill)  is  used 
for  the  consonants  also.  Some  of  the  other  syllables  used  are  English 
words.  The  basic  sounds  are  found  in  Table  II. 

The  phonemes  are  analyzed  by  the  algorithms  in  Appendix  B.  The 
energy  data  is  shown  in  Table  XI,  normalized  by  the  sound  toe  I  for  the 
first  81  phonemes  in  Table  X.  The  energy  in  the  phoneme  /oe/  corresponds 
to  the  largest  compared  to  each  of  the  other  phonemes.  The  data  is  cal¬ 
culated  by  the  program  ENERGY.  From  Table  XI,  it  can  be  seen  that  the 
energy  of  the  prediction  residual  divides  the  phonemes  into  classes  by 
phonemic  aggregations. 

It  is  well  known  that  with  simple  LPC  methods  [60],  the  excitation 
function  is  a  set  of  periodic  pulses  or  random  noises  which  can  be  iden¬ 
tified  as  high  or  low  energy  excitation  functions.  However,  by  using  the 
energy  data  in  Table  XI,  the  phonemes  can  be  grouped  into  three  classes, 
namely  high  energy,  low  energy  and  noise  groups.  The  high  energy  group 
includes  the  vowels  and  diphthongs.  The  plosive,  fricative  and  unvoiced 
phonemes  make  up  the  noise  group.  The  low  energy  group  is  composed  of 
glides  and  nasals.  It  follows  that  an  ideal  excitation  signal  for  speech 
would  enhance  perception  by  considering  a  three-tier  classification 
rather  than  the  conventional  two-source  model.  This  would  include  a 
source  for  vowels,  a  source  for  nasals  and  glides,  and  a  source  for 
fricatives.  This  is  the  result  of  the  phoneme  energy  study  of  the  pre¬ 
diction  residual.  A  normalized  energy  distribution  by  phoneme  for  each 
sub-band  is  shown  along  with  the  energy  bands  in  Figure  24. 


TABLE  XI 


ENERGY  BY  PHONEME  FOR  PREDICTION  RESIDUAL 


Phoneme 

Total 

Frequency  Band 

SB1 

SB2 

SB3 

SB4 

/i  / 

.44 

.58 

.46 

.33 

.31 

/I/ 

.75 

.51 

.46 

.75 

.45 

U/ 

.84 

.65 

.47 

1.00 

.54 

Ice/ 

1.00 

1.00 

1.00 

1  .00 

1.00 

IM 

.72 

.67 

.57 

.45 

.41 

/a/ 

.72 

.64 

.69 

.59 

.35 

m 

.83 

.60 

.68 

.70 

.42 

I  u/ 

.24 

.23 

.25 

.20 

.15 

/u/ 

.19 

.29 

.10 

.11 

.20 

IU 

.61 

.62 

.22 

.64 

.15 

/all 

1.00 

1.00 

.79 

.68 

.65 

nu 

.44 

.75 

.45 

.20 

.32 

/aU/ 

1.00 

1.00 

.95 

.90 

.78 

/oU/ 

.56 

1.00 

.31 

.21 

.53 

/el  / 

.86 

1.00 

.64 

.66 

.49 

/jU/ 

.32 

.67 

.21 

.22 

.12 

/w/ 

.24 

.35 

.24 

.10 

.23 

m 

.24 

.29 

.08 

.10 

.29 

/  r/ 

.14 

.24 

.12 

.09 

.07 

/y  / 

.11 

.20 

.08 

.08 

.07 

/m/ 

.34 

.65 

.25 

.22 

.19 

/  n/ 

.22 

.45 

.17 

.18 

.13 

/n/ 

.37 

.67 

.37 

.24 

.18 

/b/ 

.24 

.49 

.11 

.14 

.17 

/  d/ 

.32 

.63 

.27 

.14 

.21 

/g/ 

.31 

.50 

.18 

.14 

.16 

/p/ 

.18 

.27 

.08 

.07 

.08 

It/ 

.45 

.46 

.32 

.26 

.25 

A/ 

.32 

.63 

.23 

.19 

.20 

/h/ 

.45 

.46 

.24 

.31 

.31 

/j/ 

.53 

.51 

.44 

.31 

.58 

It// 

.23 

.46 

.16 

.10 

.11 

/v/ 

.16 

.29 

.13 

.09 

.09 

/er/ 

.17 

.32 

.13 

.12 

.10 

Izl 

.24 

.44 

.17 

.19 

.15 

If/ 

.07 

.04 

.07 

.05 

.12 

/e/ 

.11 

.21 

.09 

.07 

.05 

Is/ 

.08 

.05 

.06 

.06 

.13 

III 

.10 

.06 

.07 

.07 

.18 
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Based  on  the  above  discussion,  phonemes  can  be  classified  into  three 
energy  groups:  (1)  high  energy  (HE),  (2)  low  energy  (LE)  and  (3)  noise 
(N).  To  do  this,  the  normalized  residual  phoneme  energies  (second  column 
in  Table  XI)  are  the  first  tabulated;  from  this,  there  are  clear  breaks 
in  the  energy  levels  and  therefore  three  energy  groups  formed.  These 
breaks  are  used  to  identify  the  threshold  values  for  a  particular  energy 
group.  For  the  high  energy  group,  let  T^  be  the  threshold  value.  That 
is,  any  phoneme  that  has  normalized  residual  energy  greater  than  T^  is 
classified  into  the  high  energy  group.  Similarly,  T^2  and  T33  are  the 
established  threshold  values  for  low  energy  and  noise  phonemes  respec¬ 
tively.  The  three  groupings  are  given  in  Table  XII.  The  threshold 
values  T . ^ ,  i  =  1,  2,  3,  can  be  identified  from  Figure  24.  These  are 
for  the  entire  frequency  range. 


TABLE  XII 

PHONEME  ENERGY  GROUPINGS 


Energy  Groups  Phonemes 

HE  i,  I,  e ,  ae  ,  a,  A,  D,  U,  u,  £ 

LE  m,  n,  n,  z,  w,  1 ,  r,  y 

N  /,  f,  b,  d,  g,  p,  t,  k 


For  the  sub-band  coding,  threshold  values  need  to  be  computed  for 
each  band.  Also,  each  energy  group  has  to  be  divided  into  four  subgroups 
corresponding  to  the  four  sub-bands.  Let  Ein  be  the  normalized  signal 
energy  in  the  nth  frequency  band  corresponding  to  the  phoneme  that  is  in 
the  ith  energy  group.  This  is  explicitly  shown  in  Table  XIII.  For 
example,  E^  represents  the  energy  in  the  second  frequency  band  corre¬ 
sponding  to  the  high  energy  phoneme  (first  energy  group). 

The  threshold  values  for  E^n  (referred  hereafter  as  eT^)  in  Table 
XIII  will  now  be  established  using  columns  3,  4,  5  and  6  in  Table  XI. 


TABLE  XIII 

SYMBOLIC  REPRESENTATION  OF  ENERGY  DISTRIBUTION 


Frequency  Band 


1 

2 

3 

4 

H 

E11 

E12 

E13 

E14 

L 

E21 

E22 

E23 

m 

ro 

N 

E31 

E32 

E33 

E34 

Ein  is  listed  for  various  phonemes  in  columns  3,  4,  5  and  6  in 
Table  XI.  To  make  the  classification  speaker  independent,  the  E^n  has 
to  be  normalized  by  ET  given  in  (3.24).  Let 
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i  =  1,  2,  3 
n  =  1,  2,  3,  4 


From  this,  it  is  clear  that 


(3.25) 


Ein  1  1'° 


i  «  1,  2,  3 
n  =  1,  2,  3,  4 


(3.26) 


As  before,  E^n  in  (3.25)  are  tabulated  for  i  =  1,  2,  3  and  n  =  1,  2, 
3,  4.  The  breaks  are  established  from  this  tabulation  and  the  threshold 


values  are  obtained  from  these  breaks.  These  are  tabulated  in  Table  XIV. 


The  array  in  Table  XIV  will  be  referred  hereafter  as  energy  threshold 
matrix.  This  matrix  will  be  used  in  computing  the  bit  allocation  scheme, 
which  is  discussed  in  the  next  chapter. 


TABLE  XIV 

ENERGY  THRESHOLD  MATRIX 

Frequency  Band 


1 


2 


3 


4 


no 


3.9  Summary 

In  this  chapter,  th£  basis  of  coding  the  prediction  residual  at  the 
rate  of  9600  bits/second  using  the  techniques  of  sub-band  coding  was  pre¬ 
sented.  Transform  coding  and  sub-band  coding  wwere  discussed  along  with 
their  relationship.  The  method  of  achieving  maximum  intelligibility 
based  on  the  Articulation  Index  was  presented.  Transitional  information 
of  speech  along  with  the  relation  of  speech  perception  to  intelligibility 
was  discussed.  Phonemes  have  been  divided  into  three  energy  groups  so 
that  these  can  be  used  in  the  bit  allocation  scheme  to  be  discussed  in 
Chapter  IV. 


CHAPTER  IV 


ENERGY  BASED  SUB-BAND  CODING  ALGORITHM 

4.1  Introduction 


In  this  chapter  the  sub-band  coding  algorithm,  introduced  in  Chapter 
III,  is  examined  with  the  prediction  residual  as  the  input  source  signal. 
The  coding  algorithm  combines  spectral  analysis  and  waveform  coding  tech¬ 
niques.  The  combination  is  intended  to  provide  perceptual  enhancement  of 
the  speech.  The  perceptual  aspects  of  speech  are  a  key  factor  in  the  bit 
distribution  of  the  coding  algorithm.  The  bit  allocation  is  established 


by  using  the  energy  groups  discussed  in  the  last  chapter.  For  each  frame 

1  2 

and  for  each  sub- band,  the  energy  En  =  N  l  lEfn  ( k ) |  is  computed,  where 
En  indicates  the  energy  corresponding  to  the  nth  sub-band  in  a  given 
frame . 


It  is  well  known  that  most  of  the  spectral  density  for  vocalic 
sounds  and  the  fundamental  frequency  are  basically  found  in  the  sub-band 
number  one  (lowest  frequency  band).  The  intensity  of  the  energy  is  sub¬ 
stantially  high.  Spectrogram  data  can  show  this.  The  second  formant 
resides  predominantly  within  the  second  and  third  sub-bands  and  is  of  the 
low  energy  type.  These  formants  determine  the  transitional  cues  for  cer¬ 
tain  perceptual  effects.  The  energy  of  noisy  speech  sounds,  i.e.,  voice¬ 
less  fricatives,  plosives,  etc.,  has  a  basic  flat  spectrum  and  most  of 
the  energy  is  above  2  kHz.  The  perceptual  effects  are  discerned  in  this 
frequency  range.  The  spectrograms  show  the  intensity  of  the  signal 


energy  represented  by  varying  shades  of  gray  or  black  areas  [2].  The 
higher  the  energy,  the  darker  the  area.  Spectrograms  are  included  in 
Appendix  D.  These  figures  are  included  to  show  the  different  energy 
levels  associated  with  different  phonemes.  From  these  spectrograms,  it 
can  be  seen  that  vowels  are  typified  by  dark  areas;  whereas  fricatives, 
plosives,  etc.,  are  shown  in  a  gray  area.  Although  all  voiced  sounds 
show  a  dark  color  on  the  spectrogram,  Makhoul  and  Wolf  [90]  have  shown 
that  nasals  and  glides  have  a  lighter  shade  when  compared  to  other  voiced 
sounds. 

In  this  study,  the  energy  in  each  frame  of  the  prediction  residual 
is  calculated  for  each  type  of  phoneme.  The  bits  per  sample  in  each  band 
is  allocated  on  an  adaptive  basis,  using  the  perceptual  criteria  dis¬ 
cussed  in  the  last  chapter.  The  next  section  deals  with  the  bit  alloca¬ 
tion  scheme. 

The  bit  allocation  method  is  incorporated  into  the  sub-band  coder, 
which  is  discussed  in  Section  4.3.  The  adaptive  strategy  is  combined 
with  a  uniform  quantizer  with  results  presented  in  Sections  4.4  and  4.5. 
Section  4.6  gives  the  details  of  the  modules  for  computational  aspects 
of  the  coding  of  the  prediction  residual. 

4.2  Bit  Allocation 

In  this  section,  the  bit  allocation  scheme  is  discussed  using  the 
energy  groupings  in  Tables  XII  in  Chapter  III.  In  symbolic  form,  the  bit 
distribution  is  shown  in  Table  XV  for  a  three-energy  level --four  sub-band 
coder,  where  the  rows  correspond  to  the  energy  levels  and  the  columns 
correspond  to  a  particular  frequency  band.  For  example,  kgj  bits  per 
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sample  are  assigned  for  the  second  energy  (LE)  band  and  the  third  fre¬ 
quency  band. 


TABLE  XV 

SYMBOLIC  REPRESENTATION  OF  BIT  DISTRIBUTION 


Frequency  Band 


1 

2 

3 

4 

High  Energy  (H) 

kll 

k12 

k13 

k14 

Low  Energy  (L) 

k21 

k22 

k23 

k24 

Noise  (N) 

k31 

k32 

k33 

k34 

The  bits  are  allocated  by  the  empirical  formula 

E,,  i  =  1,  2,  3 

ki  i  =  log2(l  + 

1J  2  j  j  =  1,  2,  3,  4 


(4.1) 


where  E..  is  the  energy  from  Table  XIII  and  o,  is  a  normalization  factor 
i  J  J 

determined  from  the  constraint 


4 

z  k. .  N.  =  C  i  =  1,  2,  3  (4.2) 

j=l  1J  J 


with  N.,  j  =  1,  2,  3,  4,  being  the  number  of  samples  in  each  band  after 
J 

decimation.  The  value  of  C  is  equal  to  the  total  number  of  bits/frame 
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minus  the  number  of  sync  bits  per  frame.  Combining  (4.1)  and  (4.2),  it 
follows  that 


4  E. . 

N •  [1  og2 ( 1  +  ]  =  C  i  =  1,  2,  3 

j=l  J  j 


(4.3) 


where  the  normal ization  factor,  a.,  can  be  determined  from  (4.3).  Equa- 

J 

tions  (4.1),  (4.2)  and  (4.3)  define  the  algorithm. 

The  normalization  factor  is  included  to  take  into  consideration  the 
perceptual  aspects  of  the  signal.  It  is  used  as  a  weighting  factor  for 
transitional  cueing.  It  has  been  shown  that  pitch,  formant  areas,  nasal¬ 
ity  and  affrication  are  important  for  speech  perception.  Within  the 
speech  spectrum,  these  characteristics  occur  in  certain  frequency  ranges. 
The  power  density  of  speech  can  indicate  this  conception,  and  is  dis¬ 
cussed  below. 

The  speech  power  density  spectrum  is  shown  in  Figure  25.  It  is 
clear  that  most  of  the  energy  is  below  1000  Hertz.  It  has  been  shown  by 
Miller  and  Nicely  [91]  that  below  1000  Hz,  voicing,  nasality,  and  affri¬ 
cation  are  predominant  for  determination  of  the  phonemic  content.  It  has 
been  pointed  out  that  given  a  set  of  speech  signals,  a  weight  factor  can 
be  derived  when  the  speech  is  separated  into  sub-bands.  When  these  sig¬ 
nals  are  coded  properly,  there  is  an  advantage  of  distinguishing  certain 
perceptual  effects  such  as  voicing,  nasality  and  affrication.  The  per¬ 
ceptual  effects  can  be  used  for  calculation  of  bits  for  coding. 

To  compute  the  normalization  factor  properly  for  coding  the  residual 
signal,  a  bit  matrix  is  chosen.  The  bit  distribution  that  is  selected  is 
based  on  perceptual  concepts.  This  matrix  will  be  referred  to  as  an  a^ 
priori  hit  matrix.  In  addition  to  perceptual  concepts,  the  a  priori  bit 
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matrix  is  selected  such  that  the  bit  rate  is  9600  bits/second  for  the 
sub-bands  given  in  Table  IV.  The  matrix  is  shown  in  Table  XVI,  where  the 

A 

entries  will  be  referred  to  as  k..  to  denote  the  a  priori  values. 

^  3 


TABLE  XVI 

A  PRIORI  BIT  MATRIX  DISTRIBUTION 


Frequency  Band 
12  3  4 


High  Energy  1 
Low  Energy  2 


Noise 


4  3  2  2 
3  3  3  2 


2  3  3  3 


The  a  priori  bit  matrix  is  based  on  experimental  results  on  pho¬ 
nemes.  A  cursory  inspection  of  Table  XVI  reveals  that  the  perceptual 
criteria  is  preserved.  For  example,  on  lower  bands  where  pitch  and  for¬ 
mant  data  must  be  preserved  as  accurately  as  possible,  a  large  number  of 
bits  per  sample  are  used  for  encoding,  whereas  for  upper  bands  where 
fricatives  and  noisy  sounds  are  predominant,  fewer  bits  per  sample  are 
used.  Note  that  the  same  number  of  bits  for  each  energy  group  is  allo- 
cated.  Also,  the  a  priori  bit  values  (k. .)  are  used  to  compute  the  nor- 

*  J 

malization  factor  in  (4.1). 
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When  the  energy  of  the  speech  sound  is  determined  to  be  high  enough, 
the  energy  threshold  introduced  in  Chapter  III  selects  the  energy  matrix 
(from  Table  XIII)  and  a  priori  bit  values  (from  Table  XVI).  These  are 
used  to  calculate  the  normalization  factor  from  (4.1),  and 


ET. 


i  =  1,  2,  3 


(4.4) 


J  (2ku).,  j5'-2-3-4 

T  A 

where  E- •  is  the  energy  obtained  from  threshold  matrix  and  k..  is  obtained 
i  J  i  J 

from  the  a  priori  bit  matrix.  Figure  26  gives  the  distribution  of  (l/o.) 

J 

based  upon  (4.4). 

Equation  (4.1)  can  now  be  used  to  allocate  the  bits.  It  should  be 
pointed  out  that  in  using  this  equation,  actual  energy  values  of  the  sig¬ 
nal  will  be  used  rather  than  the  threshold  values.  The  following  steps 
are  performed  to  allocate  the  bits. 

1.  Spectral  estimates  are  computed  for  each  sub-band, 

2.  The  total  energy  in  the  frame  for  the  entire  frequency  band  is 
computed. 

3.  E -  -  ‘ s  are  computed . 

'  «J 

4.  Normalization  factor,  o ^ ,  is  computed 


5.  The  bits  are  allocated  by 


kij 


E.  . 

=  log2(l  +  -JJ-) 


i  =  1,  2,  3 
j  =  1,  2,  3,  4 


(4.5) 


where  E^  is  the  energy  in  the  jth  sub-band  corresponding  to  the  ith 

energy  group  and  a.  is  the  normalization  factor  from  (4.4).  Figure  27 
J 

gives  the  flow  chart  for  the  bit  allocation  scheme. 


SUB-BAND  j 


Distribution  of  Normalization  Fac 


BIT  ALLOCATION 
BY  SUBBAND 


Figure  27.  Flow  Chart  for  Bit  Allocation 


Equation  (4.5)  has  been  simulated  using  the  phonemes  in  Table  XII. 
The  bits  are  averaged  for  each  energy  group.  The  results  of  the  simula¬ 
tions  are  shown  in  Figure  28  for  each  of  the  three  energy  groups.  Dis¬ 


tinctly  shown  is  a  separation  of  the  energy  groups.  Note  that  the  low 
energy  group  which  contains  the  nasal ic  and  glide  sounds  is  shown  to 
separate  the  high  energy  and  noise  groups.  This  separation  supports  the 
three-source  theory  of  the  residual  signal. 

Earlier,  it  was  shown  that  the  residual  signal  parallels  glottal 
excitation.  The  use  of  the  residual  signal  for  encoding  the  speech  and 
later  exciting  the  speech  synthesizer  has  several  benefits.  The  bits  are 
minimized  in  the  first  and  second  sub-bands,  reducing  the  necessary  trans¬ 
mission  rate  for  these  sub-bands.  It  is  unnecessary  to  transmit  twice  as 
many  bits  for  sounds  with  nasalic,  glide  or  liquid  characteristics.  On 
the  other  hand,  the  discrimination  from  the  noise  is  shown  to  be  distinct. 
The  benefit  remains  clear  further,  because  perceptual  criteria  will  be 
enhanced  in  all  sub-bands.  Discrimination  of  sounds  can  be  benefited  with 
a  minimum  bit  allocation. 

The  bit  distribution  is  shown  by  frame  for  each  phoneme  in  Figure  29. 
Again,  it  is  shown  that  the  perceptual  criteria  is  preserved  in  that  the 
pitch  and  formant  prediminant  phonemes  receiving  a  substantial  bit  allo¬ 
cation  and  fewer  bits  are  allocated  for  fricative  and  plosive  phonemes. 
Noting  that  the  total  number  of  allowed  bits  per  frame  is  constant,  the 
difference  in  bits  per  energy  group  is  adjusted  in  the  synthesis  bits. 

This  is  discussed  in  detail  in  the  next  section. 

4.3  Sub-Band  Encoding  of  the  Prediction  Residual 

The  bit  allocation  scheme  was  used  in  the  perceptual  aspects  of 


i 
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speech  in  sub-band  coding  of  the  prediction  residual.  The  sub-band  coder 
partitions  the  frequency  band  of  the  residual  signal  into  four  sub-bands 
by  using  the  bandpass  filters.  The  partitioning  of  the  frequency  bands 
is  shown  in  Figure  20.  Each  sub-band  is  low-pass  translated,  decimated 
[by  the  Nyquist  interval  obtained  from  (3.15)],  and  encoded  according  to 
the  bit  allocation  scheme  discussed  above.  It  has  been  shown  that  sepa¬ 
rate  coding  of  each  sub-band  accomplished  the  preferenctial  perception 
criteria  for  that  band  [37].  The  decoding  of  each  sub-band  involves  an 
interpolation  and  translation  back  to  the  original  band.  The  bands  are 
summed  to  arrive  at  an  estimate  of  the  original  residual  signal  (see  Fig¬ 
ure  21).  This  section  describes  the  sub-band  coding  parameters,  the 
relation  of  the  sub-bands  to  the  Articulation  Index  and  other  perceptual 
criteria  discussed  in  this  thesis. 

The  cutoff  frequencies  for  the  sub-band  coder  are  shown  in  Table 
XVII.  The  guideline  established  for  selection  of  cutoff  frequencies  is 
to  represent  an  approximately  equal  contribution  to  the  Articulation  In¬ 
dex.  The  bands  shown  in  Table  XVII  represent  enough  of  the  important 
frequencies  such  that  intelligibility  is  preserved. 


TABLE  XVII 

SUB-BAND  CODER  CUTOFF  FREQUENCIES 


Band 

Cutoff  Frequency  (Hz) 

1 

250  -  500 

2 

500  -  1000 

3 

1000  -  1700 

4 

2000  -  3000 

BIT  DISTRIBUTION  BY  FRAME 


HE  (VOCALIC)  LE  (NASAL, GLIDE)  N( FRICATIVE) 

PHONEME 

Figure  29.  Bit  Distribution  for  Phonemes  by  Frame 
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The  Integer-band  sampling  scheme  [37]  was  also  analyzed  at  the  sam¬ 
pling  rate  of  8000  Hertz.  The  technique  requires  the  ratio  of  upper  to 
lower  cutoff  frequencies  of  the  sub-band  be  (m^  +  l)/m. ,  where  m^  is  an 
Integer.  These  bands  are  related  at  the  bit  rate  such  that  the  data  can 
be  synchronized  when  multiplexed.  Table  XVIII  is  helpful  in  constructing 
the  sub-bands.  Previous  authors  have  given  the  choice  of  bands  that  re¬ 
late  at  other  sampling  rates  [36]  [37].  Shown  in  Table  XVIII  are  integer- 
band  sampling  cutoff  frequencies  for  an  8000  Hertz  sampling  rate.  The 
Integer  decimation  ratio  is  shown  in  Column  1  for  8000  Hertz.  The  band- 
widths,  fj  are  Indicated  in  Column  2.  The  sampling  rate,  2f. ,  is  shown 
In  Column  3.  In  Columns  2,  3  and  4,  the  cutoff  frequencies  are  indicated 
Implicitly.  Integer-band  sampling  is  not  used  in  this  thesis,  and  is 
given  here  for  completeness. 

To  explain  how  each  band  is  related,  the  analysis  of  the  sub-band 
coder  Is  discussed.  The  sub-band  coder  is  designed  for  9600  bits/second. 
The  transmitted  coder  parameters  include  the  sub-band  coded  prediction 
residual  signal,  PARCOR  coefficients  and  sync  bits.  Table  XIX  represents 
a  breakdown  of  sub-band  coder  parameters  for  the  high  energy  phonemes. 
Table  XX  shown  sub-band  coder  parameters  relative  to  the  low  energy 
sounds.  Table  XXI  represents  those  parameters  relative  to  the  noise 
sounds.  The  difference  in  Tables  XIX,  XX,  and  XXI  are  the  bits  allocated 
and  the  transmission  rates  per  band,  and  the  sync  bits. 

It  Is  well  known  that  the  decimation  rate  shown  in  Column  4  of  Tables 
XIX  through  XXI  represent  an  integer  number  of  samples  available  before 
encoding.  These  available  samples  are  related  to  the  9600  bits/second 
transmission  rate.  The  fractional  representation  for  each  frame  and  sub¬ 
band  samples  are  shown  in  Table  XXII. 


125 


TABLE  XVIII 

INTEGER-BAND  SAMPLING  CUTOFF  FREQUENCIES  FOR 
8000  HERTZ  SAMPLING  RATE 


Decimation 

Ratio 

f1 

2fi 

3fi 

*fi 

1 

4000 

8000 

12000 

16000 

2 

2000 

4000 

6000 

8000 

3 

1333 

2666 

3999 

5332 

4 

1000 

2000 

3000 

4000 

5 

800 

1600 

2400 

3200 

6 

666 

1332 

1998 

2664 

7 

571 

1142 

1713 

2284 

8 

500 

1000 

1500 

2000 

9 

444 

888 

1332 

1776 

10 

400 

800 

1200 

1600 

11 

363 

728 

1089 

1452 

12 

333 

666 

999 

1332 

13 

308 

616 

924 

1232 

14 

286 

572 

858 

1144 

15 

266 

534 

798 

1064 

16 

250 

500 

750 

1000 

17 

235 

470 

705 

940 

18 

222 

444 

666 

888 

19 

210 

420 

630 

840 

20 

200 

400 

600 

800 

21 

190 

380 

570 

760 

22 

182 

364 

546 

728 

23 

174 

348 

522 

696 

24 

167 

334 

501 

668 

25 

160 

320 

480 

640 

26 

154 

308 

462 

616 

27 

148 

296 

444 

592 

28 

143 

286 

429 

572 

30 

133 

266 

399 

532 

31 

129 

258 

387 

516 

32 

125 

250 

375 

500 
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TABLE  XXII 

REPRESENTATION  OF  SAMPLES  FOR  A  FRAME  FOR  HIGH  ENERGY  SOUND 


Band 

Fraction/Frame 

Samples/Frame 

1 

.207 

53 

2 

.312 

80 

3 

.180 

45 

4 

.207 

53 

Sync  and  Synthesis 

.094 

24 

1.000 

256  Samples/Frame 

The  multiplexing  (see  Figure  21)  is  simulated  on  the  computer  by 
first  appending  each  of  the  decimated  signals  to  256  points  per  frame  by 
adding  zeros.  Second,  the  DFT's  of  these  are  taken.  Third,  the  trans¬ 
formed  signals  are  summed.  Finally,  the  IDFT  of  the  summed  signal  is 
the  multiplexed  signal,  which  has  256  points.  The  demultiplexing  in 
Figure  21  is  simulated  using  the  inverse  process.  That  is,  first,  the 
decoded  signal  is  transformed.  Second,  it  is  divided  into  four  frequency 
bands.  Third,  these  frequency  coefficients  in  each  band  are  appended  by 
zeros  to  get  256  points.  Finally,  the  IDFT  of  these  signals  are  taken, 
which  gives  the  demultiplexed  signals. 

Shown  in  each  of  Tables  XIX  through  XXII  is  a  band  labeled  "Sync  and 
Synthesis."  These  parameters  include  synchronization  bits  and  synthesis 
parameters  for  the  receiver.  The  synchronization  bits  include  one  to 
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establish  the  beginning  of  a  frame  and  three  to  determine  if  the  frame 
contains  a  high,  low  or  noise  energy  signal.  The  remaining  samples  in 
the  sync  and  synthesis  bits  are  allocated  to  the  PARCOR  coefficients  for 
synthesizing  the  speech. 

The  PARCOR  coefficients  are  distributed  between  the  range  of  |k,|  <_ 

J 

1  and.  In  most  cases,  the  entire  range  is  not  required  [28].  It  has  been 
shown  that  the  odd-ordered  coefficients  are  somewhat  skewed  toward  the 
positive  side,  whereas  the  even-ordered  coefficients  are  someqhat  skewed 
toward  the  negative  side  [28].  The  limitation  of  a  quantizer  range  re¬ 
sults  in  better  speech  quality  for  a  given  number  of  bits  assigned  to 
each  coefficient.  These  parameters  have  been  studied  in  depth  in  the 
literature.  Further  quantization  characteristics  of  the  PARCOR  coeffi¬ 
cients  can  be  found  in  [7]  [28]  [31]  [119].  Th?se  aspects  are  used  in 
adjusting  the  synthesis  bits  in  Tables  XIX  to  XXII,  and  is  outlined 
below. 

Specifically,  the  following  procedure  can  be  used  in  assigning  bits 
for  synthesis  parameters.  For  high  energy  sounds,  20  bits  can  be  utilized 
for  the  10  PARCOR  coefficients.  The  bit  allocation  for  low  energy  sounds 
for  the  PARCOR  coefficients  is  70.  For  the  noise  energy  sounds,  the  bit 
allocation  Is  170.  Note  that  more  bits  are  available  for  the  PARCOR 
coefficients  corresponding  to  the  low  energy  and  the  noise  signals  as 
compared  to  the  high  energy  signals.  These  allocations  in  synthesis 
parameters  for  encoding  are  adequate.  Actual  implementation  of  the  bit 
allocations  for  the  PARCOR  coefficients  and  their  effect  on  the  coder  has 
yet  to  be  done. 


The  find  tuning  of  quantization  parameters  has  yet  to  be  done.  The 
total  sub-band  system  requires  many  trade-offs  in  the  analysis  section. 
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In  the  analysis  section,  allowance  must  be  made  for  the  transmission  rate 
for  each  sub-band.  In  the  next  section,  the  uniform  quantization  method 

is  discussed. 


4.4  Adaptive  Uniform  Quantization 

The  sub-band  coder  partitions  the  residual  signal  into  four  fre¬ 
quency  bands.  These  banded  signals  are  passed  to  the  quantizer  for  re¬ 
duction  of  information  content.  The  design  of  the  quantizer  is  determined 
by  the  bits  allocated  as  discussed  earlier.  The  amplitude  of  each  resid¬ 
ual  signal  sample  is  quantized  into  one  of  2IBITS  levels,  where  IBITS  is 
the  number  of  bits  allocated  for  the  sub-band.  The  information  content 
of  the  digitized  signal  is  IBITS  bits  per  sample.  It  is  shown  in  Column 
6  In  Tables  XIX  through  XXI  that  the  information  rate  for  each  sub-band  is 

Information  Rate  =  (Sampling  Fre q.)n  x  I  bits/second 

I  *  1,  ....  IBITS  (4.6) 

where  (Sampling  Freq.)n  is  the  sampling  frequency  for  the  nth  sub-band. 

After  quantization  the  discrete  amplitude  level  of  the  signal  sample 
has  a  value  expressed  in  binary  decimal  of  length  IBITS.  The  value  of 
IBITS  ranges  from  1  to  5.  For  example,  the  value  of  2  for  IBITS  yields 
amplitude  levels  of  00,  01,  10  and  11;  whereas,  a  value  of  5  would  yield 
32  five  binary  length  words. 

The  range  of  the  quantizer  is  aligned  such  that  the  amplitudes  of  the 
Input  residual  signal  will  be  within  the  range  of  the  maximum  swing  of  the 
output  of  the  quantization  levels.  The  method  for  accomplishing  the 
assurance  that  no  overload  occurs  is  based  on  a  scheme  of  analyzing  each 
frame  before  quantization;  i.e.,  the  range  of  the  signal  is  found  before 
quantization.  This  is  compared  to  the  bits  allocated.  An  adjustment  is 
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made  If  needed  by  rounding  the  bits  allocated  to  the  next  integer.  The 
method  of  quantization  will  be  discussed  next. 

It  has  been  shown  that  a  characteristic  of  sub-band  coded  speech  is 
that  it  has  no  sample-to-sample  correlation  [36]  [37].  Following  this, 
encoding  is  best  performed  by  adaptive  pulse  code  modulation  (APCM)  [109] 
[121].  Previous  encoding  based  on  differential  or  fixed  prediction  does 
not  achieve  good  results  for  speech  using  sub-band  coders  [37].  Each 
sub-band  utilizes  a  uniform  quantizer  characteristic.  Each  sub-band 
exhibits  a  different  level  of  energy;  therefore,  an  adaptive  uniform 
quantizer  is  used  utilizing  a  technique  that  shrinks  and  expands  the 
quantizer  by  sub-band  such  that  the  signal  is  within  the  range  of  the 
maximum  quantization  level  for  that  sub-band. 

To  implement  the  adaptive  uniform  quantizer,  let  the  step  size  be 
denoted  by  a.  Figure  30  illustrates  the  characteristic  for  the  adaptive 
uniform  quantizer  [109]  and  will  be  idscussed  in  detail.  It  is  well 
known  that  the  uniform  quantizer  level  produces  error  which  follows  the 
uniform  distribution.  That  is,  the  probability  density  function  of  the 
quantization  error  Qe  is  given  by 


f(Qe) 


1  -A 
A  *  2 


< 


with  the  variance 


<>2(Qe) 


A* 

12 


The  step  size  is  dependent  on  the  bits  allocated. 
Let  the  number  of  levels  be  represented  by 


(4.7) 


(4.8) 


NL  =  21  i  =  1,  ....  IB ITS 


(4.9) 
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then 


A 


2_ 

NL 


[efn(n)]Mx 


(4.10) 


where  [  ef  (n)  ]  is  the  maximum  value  of  the  nth  sub-band  residual 
Tn  max 

signal . 

In  order  to  achieve  the  quantized  value,  let 

jfe.  ”  tC-j ,  ^2*  •••»  (^>11) 


be  an  (NL)-vector  used  to  identify  the  parameters  of  the  quantizer  levels 
such  that 


Q  =  A  •  £  (4.12) 

where  the  vectors  £  and  ^  are  of  dimension  NL  and  represent  the  quantizer 
values.  The  entries  in  ±  are  given  by 


C 


i 


(NL  . 
1  2 


l  +  1) 


1  <  i  < 


NL 

2 


=  0 


=  i 


y  +  2  <  £  <  NL 


(4.13) 


From  (4.12)  and  (4.13),  it  follows  that  the  quantized  level  in  ^  is 
given  by 

Q*  -  A  •  (4-H) 


The  quantized  values  of  the  residual  signal  are  obtained  by  rounding  it 
to  the  nearest  quantized  level,  which  is  used  to  code  the  signal. 

In  the  next  section,  performance  measures  are  discussed  for  the 


quantizer  and  the  sub-band  coder. 
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4.5  Signal-to-Noise  Ratio 
Performance  Measurements 

In  the  previous  section,  the  quantization  is  done  for  the  banded 
prediction  residual.  In  this  section,  performance  measuremetns  will  be 
discussed.  It  has  been  recognized  in  the  literature  that  signal-to-noise 
ratio  (SNR)  is  an  inadequate  performance  measure  for  speech  coding  [109]. 
This  idadequacy  is  realted  to  the  idea  that  additive  white  noise  is  not  a 
good  model  for  error  waveforms  in  speech  quantization.  Generally,  most 
authors  supplement  the  SNR  by  subjective  and  perceptual  measurements  as  a 
rule. 

The  SNR  is  still  the  single  most  informative  measure  for  quantizer 
performance  [109].  If  the  quantizer  is  designed  for  maximum  SNR,  the 
step  size  can  be  chosen  according  to  the  probability  density  function  of 
the  signal  [122].  However,  the  SNR  improvement  is  offset  by  greater  idle 
channel  noise  for  speech  [123].  The  result  is  poorer  subjective  perfor¬ 
mance  [123].  Therefore,  to  enhance  SNR  an  adaptive  quantizing  technique 
is  used  based  on  the  allocation  of  bits. 

It  has  been  shown  that  transform  coding  with  adaptive  quantizers 
maximizes  SNR  and  lowers  the  idle  channel  noise  [81].  Intuitively,  sub¬ 
band  coding  should  follow  under  similar  conditions.  With  sub-band 
coding,  the  quantization  noise  of  each  band  is  contained  within  that  band 
and  therefore,  minimizes  the  quantization  noise  of  the  coded  speech  [36]. 
Due  to  the  characteristics  of  the  speech  spectrum,  the  quantization  dis¬ 
tortion  is  not  equally  detectable  at  all  frequencies.  This  technique 
offers  a  means  of  controlling  the  quantization  noise  across  the  speech 
spectrum  and,  therefor  a  realization  of  improvement  in  signal  quality 
[36]. 
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The  definition  of  each  objective  measure  will  be  discussed  next. 
Perhaps  the  most  common  measurement  of  performance  is  the  conventional 
(normalized)  SNR  which  is  defined  as 

N-1  21 

£  (x(k)  -  y(k))2 

z  x2(k) 
k=0 

where  x(k)  is  the  input  to  the  coder  and  y(k)  is  the  output  of  the  de¬ 
coder.  It  is  assumed  that  the  numerator  represents  the  noise  of  the 
coding  technique,  such  that  as  the  noise  decreases  a  smaller  SNR  will  be 
the  result  of  the  summation  in  (4.15).  The  advantage  of  this  quantity 
is  a  representation  of  the  normalization  of  the  error  between  the  coder 
Input  and  the  decoder  output.  For  speech  there  is  no  perceptual  advan¬ 
tage  in  maximizing  the  SNR;  however,  the  SNR  in  (4.15)  could  be  optimized 
for  the  autocorrelation  of  the  speech  [122]. 

Another  measure  similar  to  (4.15)  is  the  root-mean-square  error 
which  is  defined  as 


NSNR  =  -10  log1Q 


RMSSNR  =  -20  log1Q 

where  x(n)  and  y(n)  are  defined  as  before.  In  (4.16),  the  error  is 
assumed  to  be  of  random  nature,  and  is  normalized  by  the  factor  N,  the 
number  of  data  points. 


N-l 

z  (x(n)  -  y(n))‘ 
n=0 


N 


(4.16) 


A  third  measure  is  defined  as 
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MSSNR 


^  e  10  log 
N  n=0 


(x(n)  -  y(n))2 
x2(n) 


(4.17) 


where  x(n)  and  y(n)  are  expressed  as  before.  The  representation  in  (4.17) 
i  defines  some  measure  of  error. 

:  The  results  using  (4.15),  (4.16)  and  (4.17)  are  shown  in  Table  XXIII. 

These  are  computed  by  program  SNRCAL  (see  Appendix  B).  These  results 
exemplify  good  coder  performance.  Note  that  these  simulations  are  done 
without  bit  assignment  to  PARCOR  coefficients.  Several  phonemes  are  used 
in  these  measurements  and  they  give  an  adequate  measure  of  the  coder. 
However,  the  complete  simulation  should  include  quantization  of  all  param- 
|  eters  to  complete  the  9600  bits/second  coding  algorithm.  The  next  section 

discusses  the  computational  aspects  for  coding  and  decoding  the  prediction 
residual. 


TABLE  XXIII 

SIGNAI-T0-N0ISE  PERFORMANCE  MEASUREMENT 
FOR  SEVERAL  PHONEMES 


Phoneme 

RMSSNR 

NSNR 

MSSNR 

/I/ 

29.2 

36.7 

18.2 

It/ 

37.2 

36.9 

19.1 

/ae! 

35.1 

37.4 

17.5 

IM 

32.9 

34.9 

15.2 

/a/ 

30.1 

38.4 

18.3 

M 

36.8 

38.7 

17.7 

Ifl 

29.8 

38.0 

18.4 

/al/ 

31.3 

37.0 

18.2 

/aU/ 

34.2 

38.4 

18.7 

/OU/ 

29.4 

37.9 

16.8 

/el/ 

33.4 

39.0 

17.0 

4 


4.6  Computation  for  Coding  the 
Prediction  Residual 

The  flow  chart  that  gives  all  the  computer  modules  is  given  in  Fig¬ 
ure  31  for  coding  the  residual  signal.  The  data  blocks  shown  in  Table 
XXIV  represent  data  processed  and  online  storage  during  the  computations. 


DATA  BLOCKS 

TABLE  XXIV 

FOR  PROCESSING 

AND  STORAGE 

Data  Block 

Record 

Number  of 

Name 

Length 

Records 

Module  Used 

BURGE. OAT 

256 

82 

DIGITIZ/WINDOW 

WIN00W.DAT 

256 

16 

WINDOW/AUTO/LATTIC/INVERS 

AUT0.DAT 

176 

16 

AUTO/LATTIC/INVERS 

RESIOUAL.DAT 

256 

16 

INVERS/LATTIC/FFTMGR/SUMLPD 

SPECTM.DAT 

256 

16 

FFTMGR/RESULT/(PITCH) 

BITS. OAT 

16 

16 

F  FTMGR/ SUMLP  D/ ENCODE 

PHAZ.DAT 

256 

16 

FFTMGR/ RESULT 

CODE . DAT 

256 

16 

ENCODE/DECODE 

SIGNAL.DAT 

256 

16 

DECODE/RESULT 

SQNR.DAT 

256 

16 

RESULT 

SBAND1 . DAT 

256 

16 

SUMLPD/ ENCODE 

SBAND2.DAT 

256 

16 

SUMLPD/ENCODE 

SBAND3.DAT 

256 

16 

SUMLPD/ ENCODE 

SBAND4.DAT 

256 

16 

SUMLPD/ENCODE 

The  modules  are  arranged  to  generate  and  use  the  data  in  Table  XXIV 
on  the  INTERDATA  70.  The  tape  recorder  inputs  an  analog  signal  to  the 
computer  while  DIGITIZ  computes  a  sampled  signal  and  stores  the  digitized 
signal  on  disk  in  location  BURGE.DAT.  DIGITIZ  is  set  up  to  store  4096 
points.  This  program  calls  an  assembly  language  digitizer  and  sequence 
clock  for  sampling.  This  program  is  flexible  for  sampling  any  analog 
signal  and  storing  the  signal  on  disk. 

Program  WINDOW  uses  the  data  on  disk,  BURGE.DAT.  The  data  is  win¬ 
dowed  using  a  256-point  Hamming  window.  The  user  has  the  option  of 
selecting  which  record  of  the  digitized  data  to  window.  The  program 
reports  the  sequence  selected  and  also  scales  the  data.  The  window  data 
is  written  in  data  block  WIND0W.DAT. 

Routine  AUTO  calculates  predictor  and  PARCOR  coefficients  using 
Levinson's  method  [61].  The  program  uses  as  input  the  window  data,  WIN¬ 
DOW. DAT.  The  output  is  an  array,  AUT0.DAT,  containing  autocorrelation 
coefficients,  predictor  coefficients,  cross-correlation  coefficients  and 
reflection  coefficients. 

Routine  INVERS  uses  the  data  from  AUTO,  AUT0.DAT,  for  use  in  the 
lattice  filter  implementation  from  Equation  (2.31)  and  (2.34).  The  order 
of  the  filter  is  ten.  The  output  from  this  program  are  the  residual 
values.  This  output  is  stored  in  RESIDUAL.DAT.  Routine  LATTIC  is  the 
sames  as  INVERS  except  that  LATTIC  gives  the  user  the  option  to  produce 
a  plot  of  the  speech  and  prediction  residual  on  CALCOMP. 

The  FFTMGR  module  is  an  FFT  manager  that  includes  a  bit  reversal  and 
unscrambler.  The  input  to  this  program  is  the  prediction  residual,  RE¬ 
SIDUAL.DAT.  This  routine  calculates  the  avergae  spectrum,  magnitude 
square  and  the  energy  of  the  prediction  residual.  It  calculates  the 
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energy  per  sub-band.  It  uses  an  a^  priori  estimate  of  the  energy  and  bits 
to  calculate  the  normalization  factor  and  bits  for  each  sub-band.  The 
program  writes  on  disk  the  spectrum,  SPECTM.DAT,  the  bits  allocated, 

BITS. OAT  and  the  phase,  PHAZ.OAT.  It  also  gives  the  user  the  option  for 
a  plot  of  the  spectrum  on  CALCOMP. 

Routine  SUMIPD  passes  the  prediction  residual  through  a  digital 
bandpass  filter.  The  signal  is  modulated,  lowpass  filtered  and  decimated 
as  shown  in  Figure  21.  The  input  to  this  program  is  the  data  file  RESID¬ 
UAL. DAT.  The  outputs  are  the  four  sub-bands,  SBAND1.DAT,  SBAND2.DAT, 
SBAND3.DAT,  and  SBAND4.DAT. 

The  signal  corresponding  to  the  four  sub-bands  are  encoded  using  the 
bits  allocated  in  BITS. DATA  by  useing  the  program  ENCODE.  ENCODE  allows 
for  32  levels  of  code.  In  case  of  non-integer  numbers,  the  quantizer, 
QUNTIZ,  rounds  the  bits  to  determine  the  number  of  quantizable  levels. 

A  uniform  quantization  is  used  to  determine  the  code.  The  output  is 
written  in  C0DE.DAT. 

Routine  DECODE  uses  C0DE.DAT  as  input.  In  the  initial  frame,  the 
maximum  number  of  quantization  levels  is  determined.  This  maximum  sets 
the  level  for  the  inverse  quantizer.  Then  the  signal  is  decoded  and 
written  in  file  SIGNAL.DAT. 

The  program  RESULT  interpolates,  modulates  and  bandpasses  the  signal, 
SIGNAL.DAT,  for  reconstruction.  The  routine  calculates  signal -to-noise 
ratio  given  by  (4.15)  and  (4.17). 

4.7  Summary 

In  this  chapter,  the  energy  based  sub-band  coding  algorithm  was  pre¬ 
sented.  The  method  of  allocation  of  bits  was  discussed.  The  design  of 
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the  sub-band  encoding  of  the  prediction  residual  was  presented.  The  com¬ 
putational  aspects  for  coding  the  prediction  residual  were  discussed. 


CHAPTER  V 


SUMMARY  AND  SUGGESTIONS  FOR  FURTHER  STUDY 
5.1  Summary 

This  thesis  investigates  an  efficient  coding  of  the  prediction  resid¬ 
ual  using  the  technique  of  sub-band  coding  at  the  bit  rate  of  9600  bits/ 
second.  The  energy  of  the  prediction  residual  is  used  to  distribute  the 
bit  allocation  by  sub-bands  such  that  perceptual  criteria  is  preserved. 

The  perceptual  criteria  is  enhanced  by  transition  information  embedded  in 
the  phoneme  connections  of  speech  by  a  technique  that  weights  the  energy 
based  on  a  normalization  factor. 

Each  sub-band  is  partitioned  such  that  there  is  an  equitable  contri¬ 
bution  to  the  Articulation  Index  as  it  is  a  measure  of  speech  intelligi¬ 
bility.  This  is  discussed  in  relation  to  the  quality  of  speech.  The 
perception  of  speech  is  described  in  a  qualitative  sense.  The  relation¬ 
ship  between  the  Articulation  Index  and  transitional  information  is  de¬ 
scribed  as  a  method  of  discrimination  of  speech  sounds. 

The  prediction  residual  is  discussed  as  :  parallel  to  the  glottal 
waveform.  The  prediction  residual  is  formed  by  speech  through  an  inverse 
filter.  This  is  represented  as  a  deconvolution  of  speech  from  the  vocal 
tract  filter. 

The  vocal  tract  filter  is  modeled  as  a  recursive  digital  filter 
using  the  method  of  linear  prediction.  Linear  prediction  produces  the 
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prediction  residual,  which  is  the  difference  between  the  actual  and  pre¬ 
dicted  speech  signals.  Because  the  prediction  residual  is  parallel  to 
glottal  excitation,  the  prediction  residual  is  an  ideal  pitch  extractor. 

A  novel  pitch  extraction  technique  is  presented.  It  is  a  two-stage 
method  that  estimates  the  residual  spectrum  and  uses  time  samples  of  the 
residual  to  calculate  the  approximation  of  the  pitch.  The  technique  cal¬ 
culates  a  threshold  which  uses  squared  samples  to  extract  the  pitch  with¬ 
in  a  frame.  Also  it  includes  an  error  check  that  estimates  wide  variances 
of  the  pitch  within  each  period  and  is  then  updated. 

The  three-tier  classification  of  phonemes  is  derived  from  the  energy 
study  of  the  phonemes  for  the  prediction  residual.  It  is  shown  that  the 
energy  of  the  prediction  residual  divides  the  phonemes  into  classes  by 
phonemic  aggregations,  namely  high  energy,  low  energy  and  noise  groups. 

The  high  energy  group  includes  the  vowels  and  diphthongs.  The  plosive, 
fricative  and  unvoiced  phonemes  compose  the  noise  group.  The  low  energy 
group  Is  composed  of  glides  and  nasals. 

The  three-tier  classification  of  the  energy  levels  along  with  the 
four  frequency  bands  allows  for  efficient  allocation  of  bits  per  sample 
for  each  band.  The  above  method  aids  in  preserving  perceptual  criteria 
and  preserves  pitch-formant  data  by  the  allocation  of  a  large  number  of 
bits  per  sample  in  the  lower  bands.  Since  fricative  and  noisy  sounds  are 
predominant  in  the  upper  bands,  a  smaller  number  is  used  in  the  lower 
bands.  The  perceptual  criteria  is  further  enhanced  by  a  normalization 
factor. 

The  normalization  factor  is  perceptual  in  nature  and  is  used  as  a 
weighting  factor  for  transitional  cueing.  The  derivation  of  the 
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normalization  factor  is  discussed.  Additional  variations  are  given  for 
the  relationship  of  the  three  phonemic  classes  to  the  normalization 

factor. 

The  sub-band  coder  is  designed  based  on  the  normalization  factor, 
the  energy  data,  and  the  bit  allocation.  The  parameters  are  computed  on 
a  frame-by-frame  basis.  The  sub-bands  are  constructed  such  that  the  bit 
rate  of  the  data  from  each  band  can  be  synchronized  when  multiplexed  at 
9600  bits/second.  The  integer-band  sampling  scheme  is  analyzed  at  the 
sampling  rate  of  8000  Hertz  for  a  9600  bits/second  transmission  rate. 

The  sub-band  coder  is  designed  to  transmit  the  coded  prediction  residual 
signal,  synthesis  parameters  and  sync  bits  at  the  9600  bits/second  rate. 

An  integral  part  of  the  sub-band  coder  is  the  quantizer.  The  en¬ 
coding  of  the  signal  is  designed  based  on  adaptive  pulse  code  modulation. 
Uniform  quantization  is  used.  The  characteristics  of  the  quantizer  are 
discussed  in  detail.  Performance  of  the  quantizer  is  described  in  terms 
of  signal -to-noise  ratios  (SNR)  for  objective  criterion  for  quality.  The 
conventional  (normalized)  SNR  is  used  for  representing  the  error  of  the 
coder  input  and  the  decoder  output.  The  mean-square  SNR  is  used  for  an 
indication  of  gross  error.  These  SNR  measurements  are  only  an  indication 
for  quantizer  performance.  Generally,  the  SNR  must  be  supplemented  by 
subjective  and  perceptual  measurement  as  a  rule.  However,  the  SNR  mea¬ 
surements  in  this  thesis  are  used  without  listeners. 

In  the  following,  some  extensions  to  the  present  effort  are  sug¬ 
gested.  Appropriate  references  are  indicated. 
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5.2  Suggestions  for  Further  Study 

5.2.1  PARCOR  Coefficient  Study  of  Sensitivity 

The  PARCOR  coefficients  introduced  in  Chapter  II  have  been  thoroughly 
investigated  because  of  their  importance  to  speech  analysis  and  synthesis 
[9]  [28]  [31].  The  priority  is  geared  toward  the  synthesis  of  speech;  in 
that  given  the  prediction  residual  and  PARCOR  coefficients,  the  speech 
signal  can  be  adequately  regenerated.  An  extension  of  the  present  work 
would  enhance  present  efforts  in  this  area  by  studying  the  sensitivity  of 
PARCOR  coefficients  with  respect  to  the  sub-band  coding  of  the  prediction 
residual. 

5.2.2  Sub-Band  Coding  Using  Subjective 

Measurements 

The  present  work  can  be  further  advanced  by  the  use  of  sub-band  cod¬ 
ing  the  prediction  residual  at  various  bit  rates.  The  synthesized  signal 
would  then  be  used  in  a  comparative  study  for  various  bit  rates.  The  per¬ 
ceptual  question  concerning  the  method  should  be  geared  towards  a  record¬ 
ing  of  the  synthesized  speech  so  that  a  set  of  listeners  could  hear  the 
results. 

5.2.3  Energy  Threshold  Matrix  Study 

The  introduction  of  the  energy  threshold  matrix  (ETM)  in  Chapter  III 
requires  further  study.  In  this  work  it  is  seen  that  the  ETM  is  highly 
dependent  of  perceptual  criteria;  consequently,  several  variations  would 
benefit  the  present  work.  In  some  instances,  it  is  necessary  to  bias  the 
energy  group  to  enhance  the  perceptual  aspects;  but  this  is  unknown  until 
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the  energy  distribution  is  computed.  The  results  of  ETM  are  dependent  on 
transmission  rates;  however,  given  one  transmission  rate,  several  ETM  may 
be  equally  applicable  to  the  coding. 

5.2.4  Integer-Band  Coding  of  the  Prediction 

Residual 

The  integer-band  coding  method  introduced  in  Chapter  IV  for  use  with 
the  prediction  residual  has  not  been  considered  in  this  thesis.  It  is 
simple  to  implement  and  would  minimize  the  need  for  modulators.  Previous 
authors  have  studied  this  for  speech;  however,  the  subject  has  not  been 
studied  for  the  prediction  residual  [36]  [37]. 

5.2.5  Prediction  Residual  and  Noise 

A  study  that  would  greatly  benefit  the  speech  coding  area  is  to  mask 
the  prediction  residual  with  white  noise.  That  is, 

z(k)  =  ef(k)  +  v(k) 

where  e^(k)  represents  the  discrete  samples  of  the  prediction  residual 
signal  and  v(k)  represents  the  discrete  damples  of  the  white  noise. 

The  enhancement  of  the  pitch  period  markings  would  be  of  major  impor¬ 
tance  in  this  study.  Further,  the  synthesized  signal-to-noise  ratio  per¬ 
formance  measurements  would  also  be  of  interest.  The  speech  waveform  has 
been  examined  in  noise  stripping  environments;  however,  the  prediction 
residual  in  a  noise  environment  has  results  that  are  promising  [19]  [58]. 
An  aid  to  characterization  of  the  signal  would  be  to  use  the  Laplacian  or 
Gamma  distribution,  as  with  the  speech.  However,  these  distributions  are 
questionable  for  the  prediction  residual  since  the  waveform  is  different. 
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Determining  the  probability  distribution  of  the  prediction  residual  may 
be  a  study  in  itself. 

S.2.6  Modeling  the  Prediction  Residual 

The  prediction  residual  in  this  thesis  is  obtained  by  inverse  filter¬ 
ing  the  speech  signal.  Under  certain  conditions,  it  is  not  easy  to  code 
the  inverse  filter;  however,  if  a  model  was  determined  that  is  similar  to 
the  signal,  it  would  be  of  benefit  for  synthesis.  An  extension  of  the 
work  in  Chapter  II  would  be  to  compare  the  speech  and  the  prediction 
residual.  It  would  be  necessary  to  identify  the  essential  parameters 
that  can  be  derived  from  the  residual  signal,  such  as  pitch,  phase  in  f  , 
formant  characteristic  and  noise  between  pitch  period  pulses.  The  end 
results  would  approximate  an  expression  that  compares  with  the  actual 
residual  pulse.  This  in  turn  could  be  compared  with  Flanagan  and  Rosen¬ 
berg's  work  [2]  [12]  [32]. 


BIBLIOGRAPHY 


(1)  Delattre,  P.  C.,  A.  M.  Liberman,  and  F.  S.  Cooper.  "Acoustic  Loci 

and  Transitional  Cues  for  Consonants."  The  Journal  of  the 
Acoustical  Society  of  America,  Vol.  27,  No.  4  (1955),  769-773. 

(2)  Flanagan,  J.  L.  Speech  Analysis  Synthesis  and  Perception.  New  York: 

Springer-Verlag,  1972. 

(3)  Dudley,  H.  "Remaking  Speech."  The  Journal  of  the  Acoustical  Society 

of  America,  Vol.  11  (1939),  165. 

(4)  Dudley,  H.,  R.  R.  Riesz,  and  S.  S.  A.  Watkins.  "A  Synthetic  Speak¬ 

er."  Journal  of  the  Franklin  Institute,  Vol.  227,  No.  6  (1939), 
739-76?: 

(5)  Atal,  B.  S.,  and  S.  L.  Hanauer.  "Speech  Analysis  and  Synthesis  by 

Linear  Prediction  of  the  Speech  Wave."  The  Journal  of  the 
Acoustical  Society  of  America,  Vol.  50,  No.  2  (Part  7)  TT971 ), 


(6)  Atal,  B.  S.,  and  M.  R.  Schroeder.  "Adaptive  Predictive  Coding  of 

Speech  Signals."  Bell  System  Technical  Journal  (1970),  1973- 
1990. 

(7)  Markel ,  J.  D.,  and  A.  H.  Gray,  Jr.  Linear  Prediction  of  Speech. 

New  York:  Springer-Verlag,  1976. 

(8)  Markel,  J.  0.  "Digital  Inverse  Filtering  -  A  New  Tool  for  Formant 

Trajectory  Estimation. "  IEEE  Transactions  on  Audio  and  Elec¬ 
troacoustics,  Vol.  AU-20,  No.  2  (1972),  1 29^X37. 

(9)  Itakura,  F.,  and  S.  Saito.  "A  Statistical  Method  for  Estimation  of 

Speech  Spectral  Density  and  Formant  Frequencies."  Electronics 
and  Communications  in  Japan,  Vol  53-A,  No.  1  (1970),  36-43. 

(10)  Un,  C.  K.,  and  D.  T.  Magill.  "The  Residual -Excited  Linear  Prediction 

Vocoder  with  Transmission  Rate  Below  9.6  Kbits/s."  IEEE  Trans- 
actions  on  Communications,  Vol.  C0M-23,  No.  12  (1975),  1466^ 
1474. 

(11)  Markel,  J.  D.,  A.  H.  Gray,  Jr.,  and  H.  Wakita.  "Linear  Prediction  of 

Speech  -  Theory  and  Practice."  Santa  Barbara,  California: 

Speech  Communications  Research  Laboratory,  Inc.,  SCRL  Monograph 
No.  10,  1973. 


149 


150 


(12)  Rosenberg,  A.  E.  "Effect  of  Glottal  Pulse  Shape  on  the  Quality  of 

Natural  Vowels."  The  Journal  of  the  Acoustical  Society  of 
America,  Vol .  49,  No.  2,  Part  2  (1971),  583-590. 

(13)  Dunn,  J.  G.  "An  Experimental  9600-Bits/s  Voice  Digitizer  Employing 

Adaptive  Prediction."  IEEE  Transactions  on  Communication  Tech¬ 
nology.,  Vol.  COM-19,  No.  6  (1971),  1021-1032. 

(14)  Gibson,  J.  D. ,  S.  K.  Jones,  and  J.  L.  Melsa.  "Sequential  Adaptive 

Prediction  and  Coding  of  Speech  Signals."  IEEE  Transactions 
on  Communications,  Vol.  COM-22,  No.  11  (1974),  1789-1797. 

(15)  Cohn,  D.  L.,  and  J.  L.  Melsa.  "The  Residual  Encoder  -  An  Improved 

ADPCM  System  for  Speech  Digitization."  International  Communi¬ 
cations  Control  Conference  Record  (1975),  30-26  to  30-30. 

(16)  Goldberg,  A.  J.,  A.  Arcese,  T.  McAndres,  R.  Chueng,  and  R.  Freud- 

bert.  Kalman  Predictive  Encoder.  Needham  Heights,  Massachu¬ 
setts:  GTE  Syl vania,  Report  to  Defense  Communications  Engi¬ 
neering  Center,  Contract  No.  DCA  100-74-C-0058,  1975. 

(17)  McDonald,  R.  A.  "Signal-to-Noise  and  Idle  Channel  Performance  of 

Differential  Pulse  Code  Modulation  Systems  -  Particular  Appli¬ 
cations  to  Voice  Signals."  Bell  System  Technical  Journal 
(1966),  1123-1150. 

(18)  Qureshi,  S.  U.  H.,  and  G.  D.  Forney.  "Adaptive  Residual  Coder  -  An 

Experimental  9.6/16  KB/S  Speech  Digitizer."  EASC0N  1975  Record 
(1975),  29A-29E.  ' 

(19)  Berouti,  M.,  and  J.  Makoul .  "High  Quality  Adaptive  Predictive  Cod¬ 

ing  of  Speech.  "  International  Conference  on  Acoustic,  Speech 
and  Signal  Processing  Record~~Q978) »  303-305. 

(20)  Esteban,  D. ,  C.  Galand,  D.  Manduit,  and  J.  Menez.  "9. 6/7. 2  KBPS 

Voice  Excited  Predictive  Coder  (VEPC)."  International  Confer- 
ence  on  Acoustics,  Speech  and  Signal  Processing  Record  (1978), 
3CFT3TT. 

(21)  Melsa,  J.  L.,  D.  L.  Cohn,  J.  D.  Gibson,  R.  Kolstad,  D.  Kopetzky,  G. 

tauer,  and  J.  Tomkic.  Study  of  Sequential  Estimation  Method 
for  Speech  Digitization.  Notre  Dame,  Indiana:  University  of 
Notre  Dame,  Report  to  Defense  Communications  Engineering  Cen¬ 
ter,  Contract  No.  DCA  100-74-C-0037 ,  June  16,  1975. 

(22)  Atal ,  B.  S.,  and  M.  R.  Schroeder.  "Predictive  Coding  of  Speech  Sig¬ 

nals  and  Subjective  Error  Criteria."  IEEE  Transactions  on 
Acoustics,  Speech  and  Signal  Processing,  Vol.  ASSP-27,  No.  3 
(1979),  247-254. 

(23)  Cohn,  D.  L.,  and  J.  L.  Melsa.  "A  New  Configuration  for  Speech  Digi¬ 

tization  at  9600  Bits  Per  Second."  International  Conference  on 
Acoustic,  Speech  and  Signal  Processing  Record  (1979),  550-553. 


151 


Chang,  C.  S.  "An  Improved  Residual  Encoder  for  Speech  Compression.' 
International  Conference  on  Acoustic,  Speech  and  Signal  Pro¬ 
cessing  Record  (1979),  542-545. 

Magi 11,  D.  T.,  C.  K.  Un,  and  S.  E.  Cannon.  Speech  Digitization  Ex¬ 
citation  Study.  Menlo  Park,  California:  Standford  Research 
Institute,  Report  to  Defense  Communication  Engineering  Center, 
SRI  Project  1526-8. 

Dankberg,  M.  D.,  and  D.  Y.  Wong.  "Development  of  a  4. 8-9. 5  Kbps 
RELP  Vocoder. "  International  Conference  on  Acoustic,  Speech 
and  Signal  Processing  Record ~(1 979) ,  554-5B7. 

Viswanathan,  R.,  W.  Russell,  and  J.  Makhoul.  "Voice-Excited  LPC 
Coders  for  9.6  KBPS  Speech  Transmission."  International  Con- 
ference  on  Acoustics,  Speech  and  Siqnal  Processing  Record 
(1979),  558^56T7^ . . . . 

Kang ,  G .  S .  Application  of  Linear  Prediction  Encoding  to  a^  Narrow- 
band  Voice  Digitizer.  Washington,  D.  C.:  Naval  Research  Lab- 
oratory,  NRL  Report  7774,  1974. 

Kryter,  K.  "Methods  for  the  Calculation  and  Use  of  the  Articula¬ 
tion  Index."  The  Journal  of  the  Acoustical  Society  of  Ameri¬ 
ca,  Vol.  34,  No.  11  (1962),  1689-1697. 

Itakura,  F.,  and  S.  Saito.  "Analysis  Synthesis  Telephone  Based  on 
the  Maximum  Likelihood  Method."  The  Sixth  International  Con¬ 
gress  on  Acoustics  Record  (1968),  C17-C20. 

Makhoul.  J.  "Stable  and  Efficient  Lattice  Methods  for  Linear  Pre¬ 
diction."  IEEE  Transactions  on  Acoustics,  Speech  and  Signal 
Processing,  Vol.  ASSP-25,  No.  5  (1977),  423-428. 

Flanagan,  J.  L.  "Some  Properties  of  the  Glottal  Sound  Source." 
Journal  of  Speech  Hearing  Research,  Vol.  1  (1958),  99-116. 

Rabiner,  L.  R.,  B.  S.  Atal ,  and  M.  R.  Sambur.  "LPC  Prediction 
Error  -  Analysis  of  its  Variation  with  the  Position  of  the 
Analysis  Frame."  IEEE  Transactions  on  Acoustics,  Speech  and 
Signal  Processing,  Vol.  ASSP-25,  No.  5  (1977),  434-442. 

Goodman,  L.  M.  "Channel  Encoders."  Proceedings  of  the  IEEE,  (1967) 
127-128. 

White,  C.  E.  "Bits  of  Voice."  Telecommunications,  Vol.  12,  No.  4 
(1978),  46-48. 

Crochiere,  R.  E.,  S.  A.  Webber,  and  J.  L.  Flanagan.  "Digital  Cod¬ 
ing  of  Speech  in  Sub-Bands."  Bell  System  Technical  Journal, 
Vol.  55,  No.  8  (1976),  1069-1085. 


152 


(37)  Crochiere,  R.  E.  "On  the  Design  of  Sub-Band  Coders  for  Low-Bit- 

Rate  Speech  Communication."  Bell  System  Technical  Journal, 

Vol .  56,  No.  5  (1977),  747-770. 

(38)  Tribolet,  J.  M.,  P.  Noll,  B.  J.  McDermott,  and  R.  E.  Crochiere. 

"A  Study  of  Complexity  and  Quality  of  Speech  Waveform  Coders." 
International  Conference  on  Acoustics ,  Speech  and  Signal  Pro¬ 
cessing  Record  (1978),  536-590. 

(39)  Barabell ,  A.  J.,  and  R.  E.  Crochiere.  "Sub-Band  Coder  Design  In¬ 

corporating  Quadrature  Filters  and  Pitch  Prediction."  Inter¬ 
national  Conference  on  Acoustics,  Speech  and  Signal  Processing 
Record  (1979),  530-533 . 

(40)  Crochiere,  R.  E.  "A  Novel  Approach  for  Implementing  Pitch  Predic¬ 

tion  in  Sub-Band  Coding."  International  Conference  on  Acous¬ 
tics,  Speech  and  Signal  Processing  Record  (1979),  526-529. 

(41)  Mathews,  M.  J.,  J.  E.  Miller,  and  E.  E.  David,  Jr.  "Pitch  Synchro¬ 

nous  Analysis  of  Voiced  Sounds."  The  Journal  of  the  Acoustical 
Society  of  America,  Vol.  33,  No.  2  (1961 ),  179-186. 

(42)  Pinson,  E.  N.  "Pitch-Synchronous  Time-Domain  Estimation  of  Formant 

Frequencies  and  Bandwidths."  The  Journal  of  the  Acoustical 
Society  of  America,  Vol.  25,  No.  8  (1963),  1263-1273. 

(43)  Sondhi,  M.  M.  "New  Methods  of  Pitch  Extraction."  IEEE  Transactions 

on  Audio  and  Electroacoustics,  Vol.  Au-16  (1968),  262-266. 

(44)  Dubnowski,  J.  J.,  R.  W.  Schafer,  and  L.  R.  Rabiner.  "Real-Time 

Digital  Hardware  Pitch  Detector."  IEEE  Transactions  on  Acous¬ 
tics,  Speech  and  Signal  Processing,  Vol.  ASSP-24  (19761,  2-8. 

(45)  Noll,  A.  M.  "Cepstrum  Pitch  Determination."  The  Journal  of  the 

Acoustical  Society  of  America,  Vol.  41  (1967),  293-309. 

(46)  Schafer,  R.  W.,  and  L.  R.  Rabiner.  "System  for  Automatic  Formant 

Analysis  of  Voiced  Speech."  The  Journal  of  the  Acoustical 
Society  of  America,  Vol.  47,  No.  2,  Part  2  (1970),  634-648. 

(47)  Markel ,  J.  D.  "The  SIFT  Algorithm  for  Fundamental  Frequency  Esti¬ 

mation."  IEEE  Transactions  on  Audio  and  Electroacoustics, 

Vol.  AU-20- (T972) ,  367-377. 

(48)  Miller,  N.  J.  "Pitch  Detection  by  Data  Reduction."  IEEE  Trans - 

actions  on  Audio  and  Electroacoustics,  Vol.  ASSP-23  (1975) , 
72-79. 

(49)  Gold,  B.  "Computer  Program  for  Pitch  Extraction."  The  Journal  of 

the  Acoustical  Society  of  America,  Vol.  34,  No.  7  (1962),  9l6- 

§2T. 


153 


(50)  Gold,  B.,  and  L.  R.  Rabiner.  "Parallel  Processing  Techniques  for 

Estimating  Pitch  Periods  of  Speech  in  the  Time  Domain."  The 
Journal  of  the  Acoustical  Society  of  America,  Vol .  46  (1969) , 
442-448. 

(51)  Rabiner,  L.  R.,  M.  J.  Cheng,  A.  E.  Rosenberg,  and  C.  A.  McGonegal. 

"A  Comparative  Performance  Study  of  Several  Pitch  Detection 
Algorithms."  IEEE  Transactions  on  Acoustics,  Speech  and  Sig¬ 
nal.  Processing,  Vol.  ASSP-24- ,  No.  5“TT9767T399-4i8. 

(52)  Ross,  M.  J.,  H.  L.  Shaffer,  A.  Cohen,  R.  Frendberg,  and  H.  J.  Man- 

ley.  "Average  Magnitude  Difference  Function  Pitch  Extractor." 
IEEE  Transactions  on  Acoustics,  Speech  and  Signal  Processing, 
Vol.  ASSP-22  (1974TT  353-362. 

(53)  Maksym,  J.  N.  "Real-Time  Pitch  Extraction  by  Adaptive  Prediction 

of  the  Speech  Waveform."  IEEE  Transactions  on  Audio  and  Elec¬ 
troacoustics,  Vol.  AU-21 ,  No.  3  (1973),  149-153. 

(54)  McGonegal,  C.  A.,  L.  R.  Rabiner,  and  A.  E.  Rosenberg.  "A  Semi- 

Automatic  Pitch  Detector  (SAPD)."  IEEE  Transactions  on  Acous¬ 
tics,  Speech  and  Signal  Processing,  Vol.  ASSP-23  (197^7,  570- 
574TT 

(55)  Wise,  J.  D.,  J.  R.  Caprio,  and  T.  N.  Parks.  "Maximum  Likelihood 

Pitch  Estimation."  IEEE  Transactions  on  Acoustics,  Speech  and 
Signal  Processing,  Vol.  ASSP-24,  No.  5~Cl976),  418-423. 

(56)  Markel,  J.  0.  "Application  of  a  Digital  Inverse  Filter  for  Auto¬ 

matic  Formant  and  F0  Analysis."  IEEE  Transactions  on  Audio  and 
Electroacoustics,  Vol.  AU-21,  No.  3  (1973),  154-160. 

(57)  Itakura,  F.,  and  S.  Saito.  "On  the  Optimum  Quantization  of  Feature 

Parameters  in  the  PARCOR  Speech  Synthesizer."  International 
Conference  on  Speech  Communication  and  Processing  Record 
(1972),  434-437. 

(58)  Boll,  S.  F.  "A  Priori  Digital  Speech  Analysis."  (Unpub.  Ph.D.  Dis¬ 

sertation,  University  of  Utah.  1973). 

(59)  Barnwell,  T.  D.,  J.  E.  Brown,  A.  J.  Bush,  and  C.  R.  Patisaul.  Pitch 

and  Voicing  in  Speech  Digitization.  Atlanta,  Georgia:  Georgia 
Institute  of  Technology,  Research  Report  No.  3-21 -620-74BU-1 , 
1974. 

(60)  Makhoul ,  J.  "Linear  Prediction:  A  Tutorial  Review."  Proceedings 

of  the  IEEE,  Vol.  62,  No.  4  (1975),  561-580. 


(61)  Levinson,  N.  "The  Wiener  RMS  (Root  Mean  Square)  Error  Criterion  in 

Filter  Design  and  Prediction."  Journal  of  Mathematics  Physics, 
Vol.  25  (1947),  261-278.  "  ~  " 

(62)  Robinson,  E.  A.,  and  S.  Treitel.  "Principles  of  Digital  Wiener  Fil¬ 

tering,"  Geophysics  Prospectus,  Vol.  15  (1967),  311-333. 


i 

t 


154 


(63)  Shannon,  C.  E.  "A  Mathematical  Theory  of  Communication."  Bell  Sys- 

tem  Technical  Journal,  Vol .  27,  No.  3  (July  1948),  379-423,  and 
October  1948) ,  623-65 6 . 

(64)  Flanagan,  J.  L.,  M.  R.  Schroeder,  B.  S.  Atal,  R.  E.  Crochiere,  N.  S. 

Jayant,  and  J.  M.  Tribolet.  "Speech  Coding."  IEEE  Transac¬ 
tions  on  Communications,  Vol.  COM-27,  No.  4  (1979) ,  710-737. 

(65)  Tobias,  J.  V.,  ed.  Foundations  of  Modern  Auditory  Theory,  Vol.  II. 

New  York:  Academic  Press,  1972. 

(66)  Dew,  0.,  and  P.  J.  Jensen.  Phonetic  Processing  -  The  Dynamics  of 

Speech.  Ohio:  Charles  E.  Merrill  Publishing  Company,  1977. 

(67)  Rabiner,  L.  R.,  and  R.  W.  Schafer.  Digital  Processing  of  Speech 

Signals.  New  Jersey:  Prentice-Hall,  1978. 

(68)  Jakobson,  R.,  C.  G.  M.  Fant,  and  M.  Halle.  Preliminaries  to  Speech 

Analysis  ^  The  Distinctive  Feature  and  Their  Correlates.  Cam¬ 
bridge:  MIT  Press,  1969. 

(69)  Fant,  G.  Acoustic  Theory  of  Speech  Production.  The  Hague,  The 

Netherlands:  Mouton,  1960. 

(70)  Jayant,  N.  S.,  ed.  Waveform  Quantization  and  Coding,  New  York: 

IEEE  Press,  1976. 

(71)  Grenander,  U.,  and  G.  Szego.  Toeplitz  Forms  and  Their  Applications. 

Berkeley,  California:  University  of  California  Press,  1958. 

(72)  Durbin,  J.  "Efficient  Estimation  of  Parameters  in  Moving  -  Average 

Models."  Biometrika,  Vol.  46,  Parts  1  and  2  (1959),  306-316. 

(73)  Durbin,  J.  "The  Fitting  of  Time-Series  Models."  Review  of  Institu- 

tion  of  International  Statistics,  Vol.  28,  No.  3  (1960) ,  233- 
5417  — 

(74)  McGonegal ,  C.  A.,  L.  R.  Rabiner,  and  A.  E.  Rosenberg.  "A  Subjective 

Evaluation  of  Pitch  Detection  Methods  Using  LPC  Synthesized 
Speech."  IEEE  Transactions  on  Acoustics,  Speech  and  Signal 
Process ing,“VoT.  ASSP-25',  No.  3  (1977),  221-229. 

(75)  Wightman,  F.  L.,  and  D.  M.  Green.  "The  Perception  of  Pitch."  Amer¬ 

ican  Scientist,  Vol.  62,  No.  2  (1974),  208-215. 


(76)  Allen,  J.  B.  "Short-Term  Spectral  Analysis  and  Synthesis  and  Modi¬ 

fication  by  Discrete  Fourier  Transform."  IEEE  Transactions  on 
Acoustics,  Speech  and  Signal  Processing,  Vol.  ASSP-25,  No.  3 
(1977),  235-238. 

(77)  Allen,  J.  B.,  and  L.  R.  Rabiner.  "A  Unified  Theory  of  Short-Time 

Spectrum  Analysis  and  Synthesis."  Proceedings  of  the  IEEE, 
Vol.  65,  No.  11  (1977),  1558-1564. 


A 


155 


(73)  Dudley,  H.  "The  Vocoder."  Bell  Labs  Record,  Vol .  17  (1939),  122- 
126. 


(79)  Jayant,  N.  S.  "Waveform-Coding  of  Speech."  Submitted  for  publica¬ 

tion,  Journal  of  the  Acoustical  Society  of  India. 

(80)  Sambur,  M.  R.  "An  Efficient  Linear  Prediction  Vocoder."  Bell  Sys¬ 

tem  Technical  Journal ,  Vol.  54,  No.  10  (1975),  1693-1723. 

(81)  Zel inski,  R.,  and  P.  Noll.  "Adaptive  Transform  Coding  of  Speech 

Signals."  IEEE  Transactions  on  Acoustics,  Speech  and  Signal 
Process  ing,~VoT.  ASSP-25,  No.  4  (1977) ,”299-309. 

(82)  Huang,  J.  J.,  and  P.  M.  Schultheiss.  "Block  Quantization  of  Corre¬ 

lated  Gaussian  Random  Variables."  IEEE  Transactions  on  Commu¬ 
nication  System  (1963),  291-296. 

(83)  Brown,  J.  L.,  Jr.  "Mean  Square  Truncation  Error  in  Series  Expan¬ 

sions  of  Random  Functions."  Journal  of  the  Society  of  Indus¬ 
trial  and  Applied  Math,  Vol.  8,  No.  1T1960),  28-32. 

(84)  Ahmed,  N.,  T.  Natarajan,  and  K.  Rao.  "Discrete  Cosine  Transform." 

IEEE  Transactions  on  Computers,  Vol.  C-23  (1974),  90-93. 

(85)  Esteban,  D.,  and  C.  Galand.  "Application  of  Quadrature  Mirror  Fil¬ 

ters  to  Split  Band  Voice  Coding  Schemes."  International  Con¬ 
ference  on  Acoustics,  Speech  and  Signal  Process) nq  Record 

Twryr  mrrgs: - - - - 

(86)  French,  N.  R.,  and  J.  C.  Steinberg.  "Factors  Governing  the  Intel¬ 

ligibility  of  Speech  Sounds."  The  Journal  of  the  Acoustical 
Society  of  America,  Vol.  19,  NoTT  (1947),  90-119. 

(87)  Beranek,  Leo  L.  "The  Design  of  Speech  Communication  Systems."  Pro¬ 

ceedings  of  the  IRE,  Vol.  35  (1947),  880-890. 

(88)  Rabiner,  L.  R.  "Synthesis  by  Rule."  (llnpub.  Ph.D.  Dissertation, 

Massachusetts  Institute  of  Technology,  1968). 

(89)  Lieberman,  P.  Speech  Acoustic  and  Perception.  New  York:  Bobbs- 

Merrill  Company,  Inc.,  1972. 

(90)  Makhoul,  J.,  and  J.  Wolf.  Linear  Prediction  and  the  Spectral  Anal¬ 

ysis.  Cambridge:  BBN  Report  No.  2304,  1172. 

(91)  Miller,  G.  A.,  and  P.  E.  Nicely.  "An  Analysis  of  Perceptual  Con¬ 

fusions  Among  Some  English  Consonants."  The  Journal  of  the 
Acoustical  Society  of  America,  Vol.  27,  No.  2  (1955),-J38-352. 

(92)  Magi 11,  D.  T.,  E.  J.  Craighill,  D.  W.  Ellis,  and  C.  K.  Un.  Speech 

Digitization  by  LPC  Estimation  Techniques.  Menlo  Park,  Cali¬ 
fornia:  StanBford  Research  Institute,  ARPA ,  DDC-Ad-A001931 
and  AD- 785- 738,  1974. 


156 


(93)  Allen,  J.  B.,  and  R.  Yarlagadda.  "Digital  Poisson  Summation  Formu¬ 

la  and  an  Application."  Submitted  for  publication,  IEEE  Trans¬ 
action  on  Acoustics,  Speech  and  Signal  Processing. 

(94)  Allen,  J.  B.,  and  L.  R.  Rablner.  "Unbiased  Spectral  Estimation  and 

System  Identification  Using  Short-Time  Spectral  Analysis  Meth¬ 
ods."  Submitted  for  publication,  IEEE  Transactions  on  Acous¬ 
tics,  Speech  and  Signal  Processing. 

(95)  Markel ,  J.  D.  "FFT  Pruning."  IEEE  Transactions  on  Audio  and  Elec¬ 

troacoustics,  Vol .  AU-19,  No.  4  (1971  ),  305-311. 

(96)  Atal ,  B.  S.,  and  M.  R.  Schroeder.  "Linear  Prediction  Analysis  of 

Speech  Based  on  a  Pole-Zero  Representation."  The  Journal  of 
the  Acoustical  Society  of  America,  Vol.  64,  No.  5  (1978) , 

mpim 

(97)  Elias,  P.  "Predictive  Coding."  IRE  Transactions  on  Information 

Theory,  Parts  1  and  2  (1955),  16-33. 

(98)  Esteban,  D.,  and  C.  Galand.  "32  KBPS  CCITT  Compatible  Split  Band 

Coding  Scheme."  International  Conference  on  Acoustics,  Speech 
and  Signal  Processing  Record Tl 978) ,  320-325. 

(99)  Noll,  P.  "A  Comparative  Study  of  Various  Quantization  Schemes  for 

Speech  Encodlnq."  Bell  System  Technical  Journal,  Vol.  54,  No. 

9  (1975),  1597-1614: 

(100)  Wiggins,  R.  H.  Formation  and  Solution  of  the  Linear  Equations  Used 

in  Linear  Predictive  Coding.  Bedford,  Massachusetts:  Mitre 
Corporation,  Report  Electronic  Systems  Division,  Report  Nos. 
MTR-2835,  ESD-TR- 74-301 ,  1974. 

(101)  Goldberg,  A.  J.,  R.  L.  Freudberg,  and  R.  S.  Bheung.  Adaptive  Multi¬ 

level  16  KB/S  Speech  Coder.  Needham  Heights,  Massachusetts: 

GTE  Syl vania.  Report  to  Defense  Communications  Engineering 
Center,  Contract  No.  DCA  100-76-C-002,  1976. 

(102)  Goldberg,  A.  J.,  and  H.  L.  Shaffer.  "A  Real-Time  Adaptive  Predic¬ 

tive  Coder  Using  Small  Computers."  IEEE  Transactions  on  Com¬ 
munications  ,  Vol.  C0M-23,  No.  12  (1975) ,  1443-1451 . 

(103)  Qureshl,  S.  U.  H.,  and  G.  D.  Forney.  "A  9.6/16  KB/S  Speech  Digiti¬ 

zer."  International  Communication  and  Control  Conference  Rec¬ 
ord  (1975),  30-31  to  30-36. 

(104)  Qureshi,  S.  U.  H.,  and  G.  D.  Forney.  Codex  Speech  Digitizer  Ad¬ 

vanced  Development  Model.  Newton,  Massachusetts:  CODEXTor- 
poration.  Report  to  Defense  Communications  Engineering  Center, 
Contract  No.  DCA  100-76-C-0025,  1976. 

(105)  O'Neal,  J.  B.,  and  R.  W.  Stroh.  "Differential  PCM  for  Speech  and 

Data  Siqnals."  IEEE  Transactions  on  Communications,  Vol.  C0M- 
20,  No.  5  (1972)7^0^1?: 


157 


(106)  Burge,  L.  L.,  and  R.  Yarlagadda.  "An  Efficient  Coding  of  the  Pre¬ 

diction  Residual."  International  Conference  on  Acoustics, 
Speech  and  Signal  Processing  Record  (1979) , '542-545. 

(107)  Ahmed,  N.,  and  K.  R.  Rao.  Orthogonal  Transformations  for  Digital 

Processing.  New  York :  Springer-Verlag,  1975. 

(108)  Oppenheim,  A.  V.,  and  R.  W.  Schafer.  Digital  Signal  Processing. 

Englewood  Cliffs,  New  Jersey:  Prentice-Hall,  Inc. ,  1975. 

(109)  Jayant,  N.  S.  "Digital  Coding  of  Speech  Waveforms  -  PCM,  DPCM  and 

DM  Quantizers."  Proceedings  of  the  IEEE,  Vol.  62,  No.  5 
(1974),  611-632. 

(110)  Tribolet,  J.  M.,  and  R.  E.  Crochiere.  "Frequency  Domain  Coding  of 

Speech."  (Unpub.  paper). 

(111)  Peterson,  G.  E.,  and  H.  L.  Barney.  "Control  Methods  Used  in  a 

Study  of  the  Vowels."  The  Journal  of  the  Acoustical  Society 
of  America,  Vol.  24,  No.  2  (1952),  175-184. 

(112)  Winkler,  M.  R.  "High  Information  Delta  Modulation."  IEEE  Inter¬ 

national  Convention  Record,  Part  8  (1963),  260-265. 

(113)  Ross,  M.,  H.  L.  Shaffer,  A.  Cohen,  R.  Freuberg,  and  H.  J.  Manley. 

“Average  Magnitude  Difference  Function  Pitch  Extractor." 

IEEE  Transactions  on  Acoustics,  Speech  and  Siqnal  Processing, 

VoT7  ASSP--22,  No.  - - 

(114)  Newell,  A.,  J.  Barnett,  and  J.  W.  Forgie.  Speech  Understanding 

System.  Amsterdam:  North-Holi  and  Publishing  Co. ,  1973. 

(115)  Papoulis,  A.  Probability,  Random  Variables  and  Stochastic  Pro¬ 

cesses.  New  Vork:  McGraw-Hill  Book  Co.,  Inc.,  1965. 

(116)  Oppenheim,  A.  V.,  ed.  Applications  of  Digital  Signal  Processing. 

Englewood  Cliffs,  New  Jersey:  Prentice-Hall,  Inc.,  1978. 

(117)  Kalman,  R.  E.,  and  B.  S.  Bucy.  "New  Results  in  Linear  Filtering 

and  Prediction  Theory."  Journal  of  Basic  Engineering,  Trans¬ 
actions  of  the  American  Society  ofMechanical  Engineers 

TTwry;  9?-tm. - - - 

(118)  Noll,  A.  M.  "Cllpstrum  Pitch  Determination."  The  Journal  of  the 

Acoustical  Society  of  America,  Vol.  44,  No.  6  (1968),  T585- 

ran - - 

(119)  Itakura,  F.,  and  S.  Salto.  "Digital  Filtering  Techniques  for 

Speech  Analysis  and  Synthesis."  Seventh  International  Con¬ 
gress  on  Acoustics  Record  (1971),  261-264. 

(120)  Noll,  P.  "Effects  of  Channel  Errors  of  the  Signal-to-Noise  Perfor¬ 

mance  of  Speech- Encoding  Systems."  Bell  System  Technical 
Journal ,  Vol.  54,  No.  9  (1975),  1615^1636. 


158 


(121)  Papoulis,  A.  The  Fourier  Integral  and  Its  Application.  New  York: 

McGraw-HiTT7  1962. 

(122)  Noll,  P.  "A  Comparative  Study  of  Various  Quantization  Schemes  for 

Speech  Encoding."  The  Bell  System  Technical  Journal,  Vol.  54, 
No.  9  (1975),  1597-T6T4. 

(123)  Stroh,  R.  W.,  and  M.  P.  Paez.  "A  Comparison  of  Optimum  and  Log¬ 

arithmic  Quantization  for  Speech  PCM  and  DPCM  Systems."  IEEE 
Transactions  on  Communications,  Vol.  COM- 21 ,  No.  6  (1973), 
752-757. 

(124)  Paez,  M.  C.,  and  Glisson,  T.  H.  "Minimum  Mean-Squared-Error  Quan¬ 

tization  in  Speech  PCM  and  DPCM  Systems."  IEEE  Transactions 
on  Communications,  Vol.  C0M-20,  No.  4  (1972),  225-230. 

(125)  Crochiere,  R.  E.,  L.  R.  Rabiner,  N.  S.  Jayant,  and  J.  M.  Tribolet. 

"A  Study  of  Objective  Measures  for  Speech  Waveform  Coders." 
Proceedings  of  the  Zurick  Seminar  on  Digital  Communications 
(1978),  251-267. 

(126)  American  Standards  Association.  American  Standard  for  Preferred 

Frequencies  for  Acoustical  Measurement.  New  York,  i960. 

(127)  Strube,  H.  W.  "Determination  of  the  Instant  of  Glottal  Closure 

from  the  Speech  Wave."  The  Journal  of  the  Acoustical  Society 
of  America,  Vol.  56,  No.  5  (1974),  1625^629. 

(128)  Atal,  B.  S.  "Automatic  Speaker  Recognition  Based  on  Pitch  Con¬ 

tours."  The  Journal  of  the  Acoustical  Society  of  America, 

Vol  52  (197?),  1687-1697. 

(129)  Rosenberg,  A.  E.,  and  M.  R.  Sambur.  "New  Techniques  for  Automatic 

Speaker  Verification."  IEEE  Transactions  on  Acoustics,  Speech 
and  Signal  Processing,  Vol .'  ASSP-23  (1975),  169-176^ 

(130)  Levitt,  H.  "Speech  Processing  Aids  for  the  Deaf:  An  Overview." 

IEEE  Transactions  on  Audio  and  Electroacoustics,  Vol.  AU-21 
TT973),  269,  273. 

(131)  Crochiere,  R.  E.  "An  Analysis  of  16  Kb/s  Sub-Band  Coder  Perfor¬ 

mance:  Dynamic  Range,  Tandem  Connections,  and  Channel 
Errors."  Bell  System  Technical  Journal,  Vol.  57,  No.  8 
(1978),  1059-Tll3. 

(132)  Beek,  B.,  E.  P.  Neuberg,  and  D.  C.  Hadge.  "An  Assessment  of  the 

Technology  of  Automatic  Speech  Recognition  for  Military  Appli¬ 
cations."  IEEE  Transactions  on  Acoustics,  Speech  and  Siqnal 
Processing, ~VoT.  ASSP-25,  No.  4  (1977),  31 0-322 . 

(133)  Tremaln,  T.  E.,  J.  W.  Fussell,  R.  A.  Dean,  B.  M.  Abzug,  M.  D.  Cow¬ 

ing,  and  P.  W.  Boudra,  Jr.  "Implementation  of  Two  Real-Time 
Narrowband  Speech  Algorithms."  International  Conference  on 
Acoustics,  Speech  and  Signal  Processing  Record  (1976),  461^472. 


159 

(134)  Crochiere,  R.  E.  Personal  Communication,  August  7,  1978. 

(135)  Oayant,  N.  S.  Personal  Communication,  August  7,  1978. 

(136)  Flanagan,  J.  L.  Personal  Communication,  August  7,  1978. 

(137)  Esteban,  D.,  and  C.  Galand.  Personal  Communication,  August  11, 

1978,  and  April  3,  1979. 

(138)  Markel ,  J.  D.  Personal  Communication,  February  21,  1979. 

(139)  Allen,  J.  B.  Personal  Communication,  February  22,  1979. 

(140)  Sondhi,  M.  M.  Personal  Communication,  April  23,  1979. 

(141)  Tremain,  T.  E.  Personal  Communication,  June  18,  1979. 

(142)  Kang,  G.  S.  Personal  Communication,  June  26,  1979. 

(143)  Aide,  R.  Personal  Communication,  May  16,  1979,  and  June  18,  1979. 

(144)  Klein,  M.  Personal  Communication,  May  16,  1979,  and  June  18,  1979. 

(145)  Sonderegger,  R.  Personal  Communication,  August  29,  1978. 

(146)  Belfleld,  W.  Personal  Communication,  June  12,  1978. 


APPENDIXES 


160 


APPENDIX  A 

DEFINITIONS  RELATED  TO  SPEECH  SCIENCE 


161 


162 


1.  Articulation  Index  -  A  weighted  fraction  representing,  for  a  given 
speech  channel  and  voice  condition,  the  effective  proportion  of  the 
normal  speech  signal  which  is  available  to  a  listener  for  conveying 
speech  intelligibility.  It  is  computed  from  acoustical  measurements 
or  estimates,  at  the  ear  of  a  listener  of  the  speech  spectrum  and  of 
the  effective  masking  spectrum  of  any  noise  that  may  be  present. 

2.  A1 lophone  -  A  manifold  acoustic  variation  of  a  phoneme. 

3.  Coding  -  The  means  by  which  an  analog  waveform  is  discretized  then 
further  represented  in  one  of  the  well-known  methods,  e.g..  Pulse 
Code  Modulation. 

4.  Cognate  -  A  compl imentary  pair  of  fricatives.  One  is  voiced,  the 
other  is  unvoiced;  however,  the  place  of  articulation  is  the  same. 

5.  Consonant  -  Those  speech  sounds  which  are  not  exclusively  voiced  and 
mouth-radiated.  There  are  fricative,  stop  and  nasal  consonants. 

6.  Communication  -  The  means  by  which  any  transmission,  emmision  or  re¬ 
ception  of  signs,  usages  or  intelligence  of  any  nature  is  conveyed. 

7.  Excitation  Function  -  The  representation  of  the  glottis  in  the  vocal 
tract  by  mathematical  modeling  in  the  synthesis  of  voice. 

8.  Formant  -  The  resonance  component  of  a  speech  sound.  Generally,  it 
is  associated  with  the  phonetic  quality  of  a  vowel. 

9.  Fricative  -  The  speech  sound  produced  by  a  noise  excitation  of  the 
vocal  tracts.  This  noise  is  generated  by  turbulent  air  flow  at  some 
r^int  along  the  constriction  in  the  vocal  tract.  If  the  vocal  cords 
operate  with  the  noise,  the  fricative  will  be  voiced;  otherwise,  it 
is  unvoiced. 

10.  Glide  (liquid)  -  Those  sounds  characterized  by  gliding  transition  of 
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the  vocal  tract  and  influenced  by  the  context  in  which  the  sound 
occurs,  commonly  referred  to  as  semi -vowels. 

11.  Glotti s  -  The  orifice  between  the  vocal  cords. 

12.  Intel  1 igibil ity  -  The  perceptual  effect  of  understanding  speech 
sounds. 

13.  Language  -  The  set  of  principles  mastered  by  the  speaker  in  which 
resides  at  his  grasp  an  infinite  set  of  sentences.  It  is  a  system 
of  human  communication  based  on  speech  sounds  used  as  arbitrary 
symbols. 

14.  Nasal  -  The  group  of  consonants  made  with  complete  closure  of  the 
mouth  making  the  radiation  sounds  come  from  the  nostrils. 

15.  Phoneme  -  The  basic  speech  sound  element  used  which  serves  to  keep 
words  apart. 

16.  Plosives  (Stops)  -  Those  speech  sounds  which  begin  with  complete 
closure  of  the  lips.  The  lungs  build  up  pressure  behind  the  clo¬ 
sure,  suddenly  release  an  explosion  marking  the  voice  onset  time. 

17.  Pitch  -  The  difference  in  the  relative  vibration  frequency  of  the 
human  voice  that  contributes  to  the  total  meaning  of  speech,  the 
fundamental  frequency. 

18.  Quality  -  The  ability  to  identify  the  character  of  speech  sounds. 

19.  Place  of  Articulation  -  The  part  of  the  vocal  tract  where  constric¬ 
tion  occurs.  Three  places  of  articulation  are:  labial,  alveolar, 
palatal;  i.e.,  front,  middle,  and  back  of  the  mouth. 

20.  Speech  Perception  -  The  ability  of  humans  to  discriminate  and  dif¬ 
ferentiate  speech  sound  with  their  over-learned  senses. 


21.  Suprasegmental s  -  The  features  of  stress,  pitch,  intonation,  melody, 
etc.,  that  occurs  simultaneously  with  speech  sounds  in  an  utterance. 
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22.  Transitional  Cues  -  The  loci  of  frequency  determined  by  the  place  of 
articulation  connecting  phonemes. 

23.  Unvoiced  -  Speech  sounds  that  occur  without  the  vocal  cord  source 
operating. 

24.  Vocal  Tract  -  The  acoustic  tube  which  is  nonuniform  in  cross-sec¬ 
tional  area  beginning  with  the  lips  and  ending  with  the  vocal  cords. 
For  the  adult  male,  it  averages  1/  centimeters  in  length  and  varies 
from  0  to  20  square  centimeters  in  cross-section. 

25.  Voiced  -  Speech  sounds  that  are  produced  by  the  vibratory  action  of 
the  vocal  cords. 

26.  Voice  Onset  Time  -  The  delay  from  complete  closure  of  a  plosive  to 
the  beginning  of  voicing.  Generally  averages  25-30  milliseconds. 

27.  Vowel  -  Speech  sounds  produced  exclusively  by  the  vocal  cord,  i.e., 
voiced,  excitation  of  the  vocal  tract. 
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The  flow  graph  in  Figure  24  gives  all  the  programs  used  in  this 
thesis.  These  programs  were  coded  fcr  the  INTERDATA  70.  Each  of  the 
modules  is  discussed  below. 

B.l  DIGITIZ 

This  program  is  an  implementation  of  analog-to-digital  (A/D)  con¬ 
version.  It  actuates  the  equipment  and  the  A/D  converter  which  is  a  part 
of  the  computer  system.  The  input  is  from  an  analog  tape  recorder.  The 
output  corresponds  to  the  quantized  signal  with  amplitudes  of  ±10  volts 
peak-to-peak  in  steps  of  20  millivolts.  This  data  is  stored  on  disk  in 
area  BURGE.DAT.  The  samples  are  grouped  in  16  records  sequentially  with 
256  samples  per  record. 

B. 2  LOOK 

This  program  operates  on  any  data  set.  It  was  developed  as  an  in¬ 
formation  tool  for  scanning  the  data.  It  has  an  option  to  have  the  out¬ 
put  on  a  CRT  or  on  a  line  printer. 

B. 3  WINDOW 

This  routine  uses  a  256-point  Hamming  window,  shifts  by  64  points, 
uses  a  256-point  window,  and  the  process  is  continued  to  the  end  of  the 
file.  The  input  is  the  sampled  speech  data,  BURGE.DAT.  It  conveniently 
informs  the  user  that  the  sequence  is  being  windowed.  The  output  is 
scaled,  windowed  data  that  is  written  in  file  WIND0W.DAT. 
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B.4  AUTO 


This  program  calculates  the  inverse  filter  coefficients,  cross¬ 
correlation  coefficients,  partial  correlation  coefficients,  and  auto 
correlation  coefficients.  The  input  is  the  windowed  speech  data.  A 
sample  of  each  of  the  coefficients  is  printed.  They  are  written  in  the 
file  AUT0.DAT. 


B.5  INVERS  and  LATTIC 

These  routines  are  in  implementation  of  the  lattice  filter.  The 
input  is  the  windowed  speech  data.  The  output  is  the  error  signal  or 
the  prediction  residual.  This  output  is  written  in  the  file  RESIDUAL.DAT 
on  the  disk. 

Routine  LATTIC  provides  the  user  the  option  of  a  plot  of  each  frame 
for  the  input  speech  and  the  prediction  residual.  The  user  must  also 
enter  the  two  character  names  of  the  sound  for  the  frame  desired. 

B.6  FFTMGR 

This  program  calculates  the  Fourier  spectrum  of  the  speech  input. 

It  calculates  the  energy  per  frame,  splits  this  into  the  predetermined 
sub-bands  for  sub-band  energy,  and  it  computes  the  normalization  factor 
and  the  bit  allocation.  It  uses  as  input  the  residual  signal  and  out¬ 
puts  the  spectrum,  phase  and  bits.  These  are  written  in  the  files 
SPECTM.DAT,  PHAZ.DAT  and  BITS.DAT,  respectively. 

B.7  ENCODE 

Routine  ENCODE  codes  a  signal  based  on  bits  allocated.  It  uses 
uniform  quantization  using  the  adaptive  strategy  discussed  in  the  main 
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part  of  the  thesis  to  determine  the  number  of  levels  and  rounds  the  in¬ 
dividual  samples  to  the  nearest  level.  The  input  is  the  number  of  bits 
and  the  sub-band  signal.  The  coded  signal  is  written  in  file  C0DE.DAT. 

B.8  DECODE 

This  routine  decodes  the  integer  data  in  the  file  C0DE.DAT.  It 
determines  the  largest  code  level  and  calculates  the  allocated  bits  from 
this  level.  It  also  sets  a  maximum  quantization  level.  This  decoded 
signal  is  written  in  the  file  SIGNAL.DAT. 

B. 9  SUMLPD 

This  routine  computes  the  sub-band  prediction  residuals  using  the 
digital  bandpass  filters,  modulator,  lowpass  filters,  and  decimator. 

The  Inputs  are  the  signal  spectrum  and  phase.  The  outputs  are  the  deci¬ 
mated  sub-bands.  These  are  written  respectively  in  the  files  SBAND1.DAT, 
SBAND2.DAT,  SBAND3.DAT,  and  SBAND4.DAT. 

B. 10  RESULT 

This  routine  uses  the  signal  to  compute  signal -to-noise  (SNR)  ra¬ 
tios.  It  uses  as  input  the  decoded  signal  and  the  residual  signal.  The 
output  is  a  normalized  SNR  and  an  average  mean  squared  SNR.  The  user 
has  the  option  of  producing  a  plot.  If  used,  one  must  input  the  two- 
character  sound  names.  The  data  is  written  in  the  file  SQNR.DAT. 

B.ll  SNRCAL 

The  routine  calculates  from  any  two  256-point  data  arrays  the  SNR. 
The  input  is  two  arrays  of  length  256  or  less  number  of  points.  The 
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program  outputs  a  mean-squared  SNR,  a  root-mean-square  (RMS)  SNR  and  a 
conventional  (normalized)  SNR. 


B. 12  PITCH 

This  routine  estimates  the  fundamental  frequency  of  a  speech  utter 
ance.  The  input  is  the  speech  array  prediction  residual  signal  and  the 
spectrum  of  the  signal.  The  program  outputs  the  pitch. 
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ARTICULATION  INDEX 

The  concept  of  the  Articulation  Index  (AI)  was  advanced  by  French 
and  Steinberg  [86].  It  is  defined  as  a  number  obtained  from  articulation 
tests  using  nonsense  syllables  under  the  assumption  that  any  narrow  band 
of  speech  frequencies  of  a  given  intensity  in  the  absence  of  noise  car¬ 
ries  a  contribution  to  the  total  index,  which  is  independent  of  the  other 
bands  with  which  it  is  associated,  and  that  the  totals  of  all  the  bands 
is  the  sum  of  the  contributions  of  the  separate  bands  [86].  It  must  be 
proven  that  there  is  a  unique  function  relating  syllable  or  word  articu¬ 
lation  to  AI  for  any  given  articulation  crew  and  choice  of  word  list. 

In  determining  AI  under  these  conditions,  there  are  essentially  two  par¬ 
ameters  of  a  linear  communication  system  that  can  be  varied:  (a)  the 
level  of  the  speech  above  the  threshold  of  hearing,  and  (b)  the  frequency 
response  of  the  system.  Here  a  linear  system  that  is  free  from  noise  is 
assumed. 

A  curve  of  AI  versus  frequency  is  included  from  French  and  Steinberg 
[86]  in  Figure  32.  The  curve  is  derived  from  the  syllable  articulation 
gain  and  frequency  responses  of  speech  waveforms  [86].  The  syllable 
articulation  is  expressed  as  the  percentage  of  syllables  with  which  con¬ 
sonant-vowel-consonant  of  meaningless  monosyllables  are  perceived  cor¬ 
rectly. 

8aranek  [87]  pointed  out  two  important  facts.  First,  extending  the 
frequency  range  of  a  communication  system  below  200  or  above  6000  Hz 
contributes  almost  nothing  to  the  intelligibility  of  speech.  Second, 


ARTICULATION  INDEX  VS.  FREQUENCY 


Figure  32.  Composite  Articulation  Index  vs.  Cutoff  Frequency  of  Ideal 
lowpass  Filters  (After  French  and  Steinberg,  1947) 


each  frequency  band  shown  in  Table  XXV  makes  a  5  percent  contribution  to 
the  AI,  provided  that  the  orthotelephonic  gain  of  the  system  is  optimal 
(about  +10  dB)  and  that  there  is  no  noise  present. 


TABLE  XXV 

FREQUENCY  BANDS  OF  EQUAL  CONTRIBUTION 
TO  ARTICULATION  INDEX 


No. 

Edges  of  Band 

Mid-Freq 

(Mean) 

No. 

Edges  of  Band 

Mid-Freq 

(Mean) 

1 

mm 

270 

11 

1740 

2 

7;'\vBc|c  ijPfj 

mm 

380 

12 

1830  - 

2020 

1920 

3 

Hifwic 

560 

490 

13 

2020  - 

2130 

4 

5 60  - 

700 

630 

14 

2240  - 

2500 

2370 

5 

700  - 

840 

770 

15 

2500  - 

2820 

2660 

6 

840  - 

1000 

920 

16 

2820  - 

3200 

3000 

7 

1000  - 

1150 

1070 

17 

3200  - 

3650 

3400 

8 

1150  - 

1310 

1230 

18 

3650  - 

4250 

3950 

9 

1310  - 

1480 

1400 

19 

4250  - 

5050 

4650 

10 

1480  - 

1660 

1570 

20 

5050  - 

6100 

5600 

The  orthotelephonic  (OT) 
OT  Gain  (Subjective) 


gain  is  defined  by 
=  20  log  (eQ/p0)  +  20  log  (E2/eQ) 

+  20  log  (P1/E2) 


(A.  1 ) 


where 

=  free  field  pressure  necessary  to  produce  the  same  loudness 
in  the  ear  as  was  to  produce  by  the  earphone  with  voltage 
E2  across. 
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eQ  3  voltage  produced  by  the  microphone  across  the  input  re¬ 
sistor  of  the  amplifier  by  a  voice  which  produces  pressure 

p  at  a  distance  of  one  meter  in  a  free  field, 
o 

^2^eo  “  vo^ta9e  amplication  of  the  amplifier. 

OT  Gain  (Objective)  =  20  log  (eQ/P0)  ♦  20  log  R 

♦  20  log  (e2/e0)  +  20  log  (Pe/e2)  (A. 2) 

where 

R  3  ratio  of  the  pressure  produced  at  the  eardrum  of  a  listener 
by  a  source  of  sound  to  the  pressure  which  would  be  produced 
by  the  same  source  at  the  listener's  head  position  if  he  were 
removed  from  the  field. 

Pe  =  pressure  produced  at  the  eardrum  of  a  listener  by  the  ear¬ 
phone  with  a  voltage  e2  across  it;  others  are  the  same. 

The  AI  obtained  per  frequency  band  in  Table  XXV  is  successively 
added  to  arrive  at  the  total  AI. 


't  ~l 


APPENDIX  D 

SONAGRAMS 


194 


i; 


196 


197 


VI TA 


Lecjdiiti  L.  Burge,  Jr. 
Un.ididate  top  the  Degree  cf 
Doctor  of  Philosophy 


Thesis:  EFFICIENT  CODING  OT  THE  PREDICTION  RFSIiiUAL 
Major  Field:  Electrical  tug  inhering 
Hi  'q-  nohical  : 


Personal  Data:  horn  in  Ok  I  a home  City,  Oklahoma,  August  3,  1949, 
the  son  of  Mr.  and  Mrs.  L.  ..  Burge. 

Education:  Graduated  from  Douglass  High  School,  Oklahoma  City, 
Oklahoma,  in  May  1967;  received  Bachelor  of  Science  degree  in 
Electrical  Engineering  from  Oklahoma  State  University  in  1972; 
received  Master  of  Science  in  Electrical  Engineering  from 
Oklahoma  State  University  in  1973;  completed  requirements  for 
the  Doctor  of  Philosophy  degree  at  Oklahoma  State  University 
in  December  1979. 

Professional  Experience:  Engineer  Trainee,  Oklahoma  Gas  and  Elec¬ 
tric,  summers  1969-71;  student  teaching  assistant,  School  of 
Electrical  Engineering,  Oklahoma  State  University,  1970-71; 
graduate  teaching  assistant,  School  of  Electrical  Engineering. 
Oklahoma  State  University,  1971-72;  United  States  Air  Force, 
1973  ro  present. 

Prvfe^sional  Organ iza lions:  Member  of  Institute  of  Electrical  and 
Electronic  Engineers,  Sigma  Xi,  Eta  Kappa  Mu. 


END 


DATE 

FILMED 


DTIC 


